DaveWentzel.com            All Things Data

Data Architecture

HP Vertica

I've been using HP Vertica since before HP bought the company.  I first wrote about Vertica in 2010 when I wrote about column-oriented databases.  I'm again using Vertica in my daily responsibilities and I decided to try to get my Vertica certification.  I'm not a Vertica expert but I find taking certification exams helps you also understand your shortcomings of a product.  For instance, I passed my SNIA (SAN) certification on the first try with a perfect score.  However, this was just dumb luck.  Every question on NAS and tape storage I guessed at, and the right answers were fairly obvious.  Since I'm not a full-time SAN guy, I'm a data professional, I don't have much need for tape and NAS so I really didn't care to learn more about those topics.  But it was interesting learning what SNIA feels is important in the storage world.  

In the process of getting my Vertica certification I thought it would be wise to put together a blog series on Vertica for anyone else that wants to learn this technology rapidly in a hands-on fashion.  In these blog posts I'll cover what Vertica is, how to install it, we'll migrate AdventureWorks to it, and we'll do lots of fun labs along the way.  The posts will be geared toward those data folks who are familiar with MS SQL Server.  I specifically approach learning Vertica by calling out its differences with SQL Server. 

You'll find this is a lot of fun.  Vertica is pretty cool.  

Briefly, what is HP Vertica?

It is a column-oriented, compressed, relational DBMS.  Many people consider this a NoSQL solution, but it does use a dialect of SQL for its manipulations.  It is clustered across grids and nodes like many distributed NoSQL solutions, with builtin data redundancy (called k-safety), which means it has the typical marketing gimmick of "requiring no DBAs".  I can assure you that it performs QUITE NICELY for read-intensive workloads.  There is also an entire analytics platform that comes with Vertica...roughly equivalent to SSAS.  

What makes Vertica unique is that it persists the data in groupings based on the column of the tuple (table) instead of row-oriented, traditional, RDBMS offerings.  If you think of a table as a spreadsheet then retrieving a single row is an ISAM-type of operation.  This works very well for OLTP applications.  But in some reporting applications it is possible that you care more about an entire column than about the individual rows.  If your query is something like "give me the total sales for this year" then querying a columnstore will result in far less IO and will run radically faster.  

Even the amount of utilized disk to store the data will be less.  Columstores compress much better because like data types are grouped together.  Basically, there are more rows than columns and each row will need "offset" information to store the different datatypes together.  You'll need fewer offset markers if you organize your storage by column.  TANSTAAFL (there ain't no such thing as a free lunch), as economists say...the cost is that updates to existing rows require one write per column in the table.  So a columnstore is probably not best for your OLTP tables.  

That's a quick overview and is not all-inclusive.  I'll cover lots more in the next few posts.  

What's it cost?

The Community Edition is free for up to 1TB of data.  And if you consider that compressed columnstores generally take up a fraction of the space of a traditional row-oriented DBMS...that's actually a lot of data.  

HP Vertica Documentation

Vertica documentation in HTML and PDF format

Certification Process

Certification Exam Prep Guide

Getting Started

You need to register with HP to be able to download the Community Edition.  You can do this at my.vertica.com.  After you register you'll immediately be able to download the CE (or the drivers, Management Console, VMs, AWS documentation, etc).  

In the next post I'll cover how to install Vertica on Ubuntu.  You can of course download a VM instead but I believe the best way to learn a product is to start by actually installing the bits.  


You have just read HP Vertica on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

DevOps in the real world: Who should be responsible for index maintenance?

Larger ISVs usually employ at least two types of data professionals, the data developer and the database administrator.  In some situations, especially if DevOps is practiced, the lines may be a bit blurred regarding responsibilities.  I find it interesting in my consulting engagements which data professional is responsible for index maintenance.  Who is responsible for indexes at your company?  Who decides what indexes should exist?  Who creates them?  Who ensures proper index hygiene occurs (defragmentation, manual update stats)?  Which group should be responsible for these tasks?  

Informally I've found at my engagements that when both groups exist that it is the DBAs that are responsible for all aspects of indexing.  I think this is just flat-out wrong.  And I'm probably in the minority on this though.  Here's my reasoning:  

  • DBAs probably don't know all of the data access paths, and the most important ones we wish to optimize in the application.  Unfortunately your developers may not either, but that's a different issue.  
  • If you practice some variant of "database change review", where any and all schema changes to a database go through a review system to ensure they follow standards are are modeled sensibly, then you should have a good feel for what indexes should exist prior to those tables hitting your production systems.  
  • If you are doing performance testing then your developers should further understand the data access patterns.  Performance testing won't find every issue but the most egregious issues should be caught.  
  • Too many shops allow production DBAs to alter indexing schemes without consulting the data developers at all.  In EVERY case I've seen disaster because of this.  Here's a few examples:
    • we implemented queue tables as heaps and the DBAs didn't like that so they added clustered indexes and our queues experienced odd locking and concurrency behavior.  Contrary to google, sometimes heaps are good.  
    • DBAs applied indexes based on recommendations in the missing index DMVs and the plan caches.  The net result was cases where the same column in a table was the leading index column for 5 or more indexes, differing only by INCLUDED columns.  These DMVs must be countered with common sense and knowledge of your data.  
    • DBAs removed indexes because the cardinality at the time when they checked the index was LOW.  It's common for some tables to have a ProcessedStatusCode column that is binary (processed/not processed).  Throughout most of the day every row is in the processed state (1) until overnight batch processing starts.  Indexes on a ProcessedStatusCode column are critical, albeit counter-intuitive.  Many people believe that indexes on low cardinality tables are useless, but in situations like this they are required.  
    • DBAs removed indexes because sys.dm_db_index_usage_stats showed that data access was heavily skewed to updates and very little to seeks/scans/lookups.  The problem is, the indexes were needed during month-end processing and were suddenly gone when they were required.  

Contrary to what you are probably thinking, this blog post is not an "us vs them", anti-DBA rant.  "But you said above that it is flat-out wrong that DBAs are responsible for indexing."  

Right!

The resolution to all of the issues above is simple.  Collaboration.  Take the last example where important indexes were missing during month-end processing.  Clearly the DBAs had some valid reasons to remove the indexes and the data devs had valid reasons for needing them.  After our month-end processing went from 12 hours to 7 days due to the missing indexes, and after everyone stopped playing the blame game and hootin and hollerin, we collaborated and determined that we could actually build those indexes a day or so before month-end processing and then remove them later.  These indexes are huge and the chargeback for the SAN storage was not insignificant.  

This was a non-zero-sum event.  We managed to save on storage, our system was faster during critical month-end processing (those indexes were less fragmented), and the system was marginally faster during standard daily processing where the update overhead to those indexes could be eliminated. So, here's a case where a bit of collaboration led to one of those truly rare "win-win" situations.  And that's what we all want during conflict resolution.  We don't want to hurt anyone's feelings.  

(I found this graphic on an EXTREMELY popular DevOps site.  Maybe I'm being picky but I still see a separation of Dev and Ops.  Where is the collaboration?  I don't understand how this graphic couldn't be titled "Waterfall")

This is just another example where DevOps works.  My last few blog posts I pointed out that there is a misconception that DevOps means combining developers and ops guys.  It isn't.  It is about about situations like this.  Nobody is perfect.  I've worked in just as many shops where EVERY column in a table had an index because the developer figured that if a few indexes are good then more must be better.  Indexing is one of those cases where it really is difficult to determine which group should have ultimate responsibility, the developers or the DBAs.  Simple collaboration and following DevOps is the key.  

You have just read another DevOps post on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Open Source Licensing

I recently wrote Git Gotchas for the Savvy Svn'r where I mentioned in passing that some companies like monolithic software repos because it aids in Black Duck auditing.  I then received a couple of emails asking what this was and why it was important.  I also recently had a consulting engagement where the improper use of open source software had the potential to cost them a large portion of their revenue.  At around the same time Heartbleed (the OpenSSL bug) came into the light and suddenly there was a "open source vs closed source" debate.  All of this is important and you should understand it if you want to be a good architect.  

I don't want to get into a religious debate about open vs closed source.  Pick whatever works for you.  Rather, the underlying licensing of any software you are using in your projects is critical.  I consult for various ISVs and all of them srutinize third-party licensing of closed source software if they are going to redistribute any portion of it under their umbrella.  I've mentioned in the past (How to save a *lot* of money with SQL Server Express) how a lawyer for an ISV I worked for really scrutinized how I was using SQL Express.  Eventually all parties (including M$) were satisfied that I was breaking no licensing terms, but my point is that when it comes to redistributing closed source software we, as an industry, tend to scrutinize everything.  

That is good.  But we don't tend to scrutinize open source software that we use.  And we should.  

Black Duck audits essentially look at your software portfolio for areas where your OSS (open source software) is out-of-compliance with the underlying licensing.  If you use OSS, you should be familiar with Black Duck, or something similar.  If you think Open Source = Free Software...you have a BIG problem.  There are tons of free and open source software licensing.  Frankly, I don't understand them all, and I'm not a lawyer so I won't attempt to explain the nuances you'll find in these.  But there are two basic types...these are my words..."free free" (example Apache licensing) and "copyleft" ("free, if your careful"...example is GPL).  

(screenshot of wikipedia's entry for comparison of free and open source licenses.  Note that this is just the licenses at the beginning of the alphabet.)

 

My Personal Feelings

I have personal reservations about Intellectual Property (IP) rights.  But even though I don't like IP, I have to respect it in my daily dealings.  Open source licensing, in general, disgusts me because if you are a bit lax, even accidentally, your OSS vendor can swoop in and claim large portions of your profits as its own, legally.  In many regards these licenses are far more punitive and Draconian than anything I've ever seen from a closed-source vendor like M$ or Oracle.  Why is this?

OSS vendors have to Make Money.  Choice-of-license is how they do it

Some people believe that OSS vendors chose the open source model because of some kind of altruism.  "We provide our software for free to make the world a better place."  Bullshit!  Every company has to make money.  How a vendor chooses to license their product is how they make it.  There are open source licenses that foster "building a community" prior to building sustaining revenue (example, Apache licensing) and there are licenses meant to generate revenue first, often sneakily (example, GPL).  

Let's take two examples.  Company A develops some software and decides the best way it can make money with it is to build a community of avid users of the software.  "Build it and they will come", if you will.  The Apache License best meets that goal.  As a business-owner and independent consultant for ISVs I look first for software with Apache licensing (or BSD, MIT, or similar).  These licenses are business-friendly.  I'm not an attorney and I bear no responsibility for any misinformation given in this post, but essentially an Apache License gives the licensee full control to do with the software anything it wants to do with it, as long as the copyright notice is preserved.  It's about as close to free as it gets.  No royalties necessary.  In this case Company A believes the Apache License affords it the best method of gaining market momentum.  So, how does Company A make money if its software is "free"?  There are many ways, but in-app advertising and paid services (support) are the two most common.  This example is straightforward and is what most people think of with OSS.  You really have to do something stupid to get in trouble using this variety of OSS.  

GPL, "Copyleft", and Derivatives

Company B would rather sustain its business with licensing revenue (just like Oracle or M$) but it still wants to call itself OSS.  Just my opinion, but this is sneaky and bordering on dishonest.  Here's how it works.  Company B chooses a "copyleft" licensing scheme like GPL or AGPL.  Again, I'm not a lawyer but copyleft works by saying the license is just like an Apache license except that derived works must also be released as open-sourced, copyleft works.  Said differently, the license must be moved forward.  

Huh?  Company C is an ISV that would like to use some open source software to augment its closed source software.  If Company C uses an Apache-licensed piece of software then they may continue to keep their software closed-sourced without paying royalties.  But if they decide to use a GPL license then Company C *must* open source its software or it will be in violation of the GPL.  That means the GPL "moves forward".  

So, how does Company C use GPL'd software in its closed-source product?  Carefully.  All "copyleft" ISVs like Company B will offer both a GPL version of its product and then another version with a non-open-source license that the licensee can then use in its closed-source product without having to open source everything or violate the GPL. The GPL'd version is still "free", the alternatively-licensed version can have whatever royalties and terms that it desires, just like a closed source product.  By "version" I don't mean a separate set of binaries (like Windows Standard vs Enterprise), rather a different "version" of the license. 

A True Horror Story

(I find this logo less-than-truthful)
(Note that "free" means "freedom", which is nebulous.  "Free" does not mean "no cost".)

An ISV I worked for loved using OSS whenever it could.  The OSS always used Apache licensing (or equivalent) whenever we redistributed our code.  If it was software used in-house for administration or development only then nobody really cared about the licensing covenants.  There was a secret push to begin using a NoSQL alternative to SQL Server for some aspects of data persistence for a skunkworks project.  Cost was the reason.  SQL Server was deemed too expensive.  I was not involved in the evaluation or recommendation of the OSS product but a company-that-shall-remain-nameless gave a demo of their "free" distributed document store software to the skunkworks team.  I was not invited.  The product worked well and our developers coded a bunch of stuff against it that was not yet released.  

There are industry horror stories with this particular OSS vendor.  This vendor is known for touting its free OSS and then strong-arming ISVs into huge licensing fees at a later time once your software was coded and released.  You either paid this vendor for the closed-source license, or you open sourced your ENTIRE application.  Even if only a small corner of your application used the vendor's product they threatened that your WHOLE APPLICATION needed to be open sourced.  Most ISVs get scared and do not want to open source their product and rely solely on services revenue.  This would be a dramatic change for most ISV business models.  So these ISVs usually relent and pay the equivalent of shakedown money for a non-free, non-GPL license for the product.  It is either that, or rip out the GPL'd product and start over.  That would be equally unappealing.  

Soon the secret was out that there was a group developing against a new data persistence engine.  This is when I found out.  I knew this OSS package was GPL and its use could radically alter our business model.  The vendor would not give me a straight answer as to exactly how much of our product would need to be open-sourced if we took on their product.  At this point I could've gotten our legal department involved but that tends to be slow.  Much better to have the vendor speak to our management team directly.  

I organized another presentation with the NoSQL vendor and they started out touting the free, open source nature of their product, case studies of its success, etc.  This time the audience was the senior architects and the management team.  About 10 minutes into it I raised my hand and asked very pointedly, "How much will your product cost us in licensing if we choose NOT to open source our product under the terms of the GPL?"  The vendor attempted to deflect but I kept pushing.  Eventually the vendor opened a different Prezi that discussed alternative closed source licensing arrangements.  Clearly they had different presentations based on where the customer was in its implementation of their product.  The vendor did not realize that I did my homework and knew this was going to happen.  Neither did our management.  Or our architects.  "Deer in the headlights" does not do justice to the looks on everyone's faces.  

The alternative licensing was MORE EXPENSIVE per node than SQL Server, not to mention it was a distributed document store which would have meant LOTS of small, little licenses for small, little nodes.  It was eye-opening for management.  There would be no cost savings.  We quickly had our developers pick a document store that was Apache licensed and we delayed our release for a few months until we could rewrite.  

From then on there were policies on which OSS licenses we could use and rules regarding what could be redistributed.  Our Legal Department was now required to sign off on ANY third party tool where the code was checked in and integrated with our source code.  This event scared management so much that they brought in Black Duck Software to make sure our exisitng software portfolio wasn't hiding any other potential OSS timebombs.  (It was).  

Since then customers of our software have also performed the equivalent of Black Duck audits on our software to make sure we were being thorough and honest.  This is mission-critical software.  With GPL software it isn't just the licensee that can get caught up in a licensing battle, it is the "licensee or assigns", which means anyone who purchases our software.  The ramifications of this is really scary.  You could purchase a piece of software and through no fault of your own find yourself on the receiving end of a lawsuit because your vendor was improperly using GPL'd software.  This is another reason why IP rights are so dangerous.  

Choosing Licensing Carefully...Example: Postgresql vs MySQL

When I work with OSS in my personal life I pay ZERO attention to the licensing.  I use what is best for the task at hand.  In my professional life, working for ISVs, I'm much more careful.  Take Postgresql vs MySQL.  If you ask 10 experts you'll likely find that 50% prefer Postgres and 50% prefer MySQL.  No big difference regarding performance or feature-completeness.  They are both very awesome DBMSs.  Usually the choice comes down to comfort and prior experience or specific feature requirements.  That's totally anecdotal of course.  YMMV.  

However, if I needed to propose which one to use for an ISV project I'm ALWAYS proposing Postgres.  Every time.  Why?  Licensing.  Postgres is MIT licensed (similar to BSD and Apache...basically just don't alter the copyright notice and you are free to do whatever you want with it).  MySQL is GPLd and Oracle (who owns MySQL) does have some goofy rules about when a commercial license is required for redistribution.  I'm hesitant to risk another GPL nightmare and choose MySQL unless there was some compelling feature requirement.  

This is just my opinion but I believe that over time Postgres will have a higher installed base than MySQL, merely due to licensing differences.  Most OSS that needs data persistence will offer a choice in backends...usually MySQL, Postgres, or SQLite.  In these cases I choose Postgres because I believe the licensing will eventually cause it to pull away from MySQL and be the preferred data persistence engine.  SQLite is totally public domain (no license AT ALL) so it may even fare better.  

Apache licensing is always preferred to GPL-style licensing?

Not so fast.  If you aren't redistributing or selling your software then it probably doesn't matter what you use.  Check with an expert or a lawyer though.  

Apache licensing has problems too, IMHO, that you should be aware of.  The business model for Apache licensed software brings its sustainability into question, always.  An example is OpenSSL.  It is Apache licensed and is supported by the equivalent of a handful of full-time developers.  It has almost ZERO revenue to sustain the software.  Something like Heartbleed was bound to happen under this situation.  OpenSSL has a large, avid following.  Cisco and M$ use it...heck, it's embedded in EVERYTHING.  And very few users have given ANY money to the project to sustain it because it is Apache licensed (IMHO) and therefore viewed as totally free (without cost).  "Let someone else pay for it".  The very backbone of security on the internet is in the hands of a few OpenSSL developers because no one wants to contribute money.  It is an 800 lb gorilla.  In this case maybe GPL software is a bit better.  If the price isn't too steep.  And since Heartbleed there are some new GPL'd alternatives to OpenSSL that may be safer and more sustainable simply because the revenue-generation model is better.  Only time will tell of course.  Others have pointed out that something as critical as OpenSSL should be closed source anyway.  Again, time will tell.  Perhaps the best answer is a consortium of vendors that can take over OpenSSL and give it the care and feeding it needs, without resorting to GPL.  I'm certainly no expert in any of this.  But it is fascinating.  

Why is this so gd complicated?  Another story

Now you see why I think IP rights cause more grief than they solve and why EVERYTHING I do on this website is public domain with ZERO licensing, not even Apache licensing.  (I always appreciate an attribution, but I'm not going to sue you if you don't).  It's just not worth it to me to deal with all of this licensing.  I don't even have a copyright notice on this site, although I probably should.  

I do some side work as a Drupal consultant.  Drupal is GPL.  Yet there is a lot of Drupal code out there that is available only for a fee.  Especially themes.  People often wonder how that arrangement is possible.  If GPL forces derivational products to also be GPL'd, then how can a themer charge for a Drupal theme?  Very easily.  The themer must provide the source code for any theme and therefore the customer can do anything with it thereafter, including give it away for free.  

Sounds goofy doesn't it?  

I once bought a theme for $50 because it had everything my customer wanted...cool graphics, the ability to skin minutae, it was "responsive" (the ability to resize dynamically to the new form-factor devices coming to market daily), and had lots of other features not available elsewhere.  After I got the code for the theme I realized I was sold a lemon.  The theme did nothing advertised and had ZERO documentation.  I politely asked for my money back and was denied by the vendor, who spoke little English.  I then threatened to simply follow the terms of the GPL license and modify a few bits and re-release the new theme as my own and give it away for free.  

I had his attention now.  He told that was not legal that his theme was his, not mine.  "Nope, sorry, you supplied me with a Drupal theme.  Drupal is GPL therefore your theme is GPL too.  I can modify it however I choose and re-release it to the world.  I can't wait to give your hard work away under my name."  Needless to say my $50 was promptly refunded.  

Summary

There is no right or wrong answer to any of this.  The takeaway is that open source software is not cost-free software, usually.  And these vendors aren't always altruistic just because they are OSS.  You must spend some time and understand what you are getting yourself into when you decide to use OSS.  In most cases you will be fine.  If you are an ISV, be a bit more careful.  If you are an ISV that produces OSS then I implore you to be a little more open and honest about how your licensing really works.  Be forthright that you have multi-licenses for different use cases and note the pricing clearly.  Not every small ISV has a cadre of lawyers that understand open source licensing, it would be nice if GPL and GPL-like product vendors would stop trying to hide behind the "free software" label.  It's akin to bait-and-switch.  


You have just read "Open Source Licensing" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

PerformanceCollector and the Heisenberg Uncertainty Group

This is the last post in my series on my PerformanceCollector tool.  This post is a warning and an anecdote that you can over-monitor your server.  By default PerformanceCollector (PC) monitors quite a bit of information but I've had the utility running on high-throughput production OLTP workloads and I could not discern any negative performance impacts by running it using the default configuration.  It has all of the necessary fail-safes in place (I hope) to ensure it doesn't roll off the rails.  For instance, there is a configuration setting that will cap the size of the database and will auto-purge data if needed.  

The Heisenberg Uncertainty Principle may eventually come into play however and you should be aware of that.  If you are not familiar, the HUP basically says that the more you monitor something, the more the act of monitoring will adversely impact the system.  You may also hear this called the "observer effect".  

If you find that PC is a bit too heavyweight for your environment then you can disable some of the AddIns or extend the interval between runs.  Or you can tweak the code to fit your needs.  The possibilities are endless.  In older versions of the tool I ran into some cases where the system was under such stress that PC could not accurately log the waiting and blocking of the system.  That kinda renders the tool useless when you are trying to diagnose severe performance problems.  But I haven't had that happen in a few years now so either my code is getting better, or M$ has improved the performance of their DMVs to work under harsh environments.  I assume the latter.  

But you can take performance monitoring too far.  Here's a quick story to illustrate the fact.  

I consulted at a company that had deployed 2 third party performance monitoring tools that shall remain nameless.  They worked ok but did not capture the wait and subsystem information like PC does so the client decided to deploy my scripts as well.  

There were now 3 tools running.  

PC only collects waits and blocks every 15 seconds and it's likely it won't capture EVERY "performance event" that occurs during that interval.  But again, I'm worried about HUP.  

Primum non nocere.

Like any good doctor, "first, do no harm."  

Frankly, users are generally tolerant of an occassional performance blip, but anything over 15 seconds and rest assured some user will open up a help desk ticket.  That is why the monitoring interval is 15 seconds.  But this interval is a bit like Prunes Analysis (when taking prunes...are 3 enough or are 4 too many?).  You can certainly make the case that every 10 seconds is better, or that every minute is optimal.  Decisions like this are very experiential.  

This particular client decided that with a 15 second interval too much valuable information was not being logged.  They decided that if a block or wait occurred the 15 second monitoring interval would be shrunk to every second to get even better metrics.  This required one of their developers to change the WHILE 1=1 loop in PC.  If after 60 seconds of monitor at the one-second interval no blocking/waiting was observed, then the code reverted back to every 15 seconds.  

IMHO, that's far too much logging.  It is quite normal for a busy system to constantly experience some variant of waiting.  That meant that this new code was ALWAYS running in one second interval mode.  This hyper-logging mode was implemented after I left the client and they did it without consulting me, or really consulting anyone that understands a modicum of performance monitoring a SQL Server.  

It didn't take long until performance of the system was worse than ever.  At one point the instance was crashing every few days due to tempdb log space issues.  Further, PerformanceCollector did no have all of the fail-safes at the time and the db was about 1TB in size, even with 30 day purging in place.  Another performance consultant was brought in (I was out-of-pocket) and he quickly determined that PC was the biggest consumer of tempdb and IO resources on the instance.  I looked thoroughly stupid when I was called back in to the client to explain myself.  I quickly determined that my code was not at fault, rather the change to shorten monitoring interval and increase the frequency of monitoring were the causes.  

Without finger-pointing I explained that we should put my PC code back to the defaults and re-evaluate why hyper-logging was needed at all.  The other consultant agreed and within a day or so everyone was calm and happy again.  Mistakes like this happen.  

As I was wrapping up this engagement I finally asked, "So, have you identified and corrected many issues that PC has found?  Clearly if you thought hyper-logging was needed you must've corrected quite a few performance problems with the tool."

The DBA responded, pensively, "Well, we keep the trend data but we haven't opened up any tickets to the dev team to fix specific stored procs yet.  We don't have any budget for a DBA to look at the missing indexes either.  But maybe we'll get to do all of that some day."  

The devs, architects, and I choked back the laughter.  All of that monitoring and nothing to show for it.  To this day the other architects at that client call the DBAs the HUGs (Heisenberg Uncertainty Group) behind their backs.  I have since returned to that client for other engagements and I have witnessed this for myself.  Those of us in the know chuckle at the joke.  But the lead HUG didn't get the joke apparently…he got angry once and asked not to be compared with Walter White. 

Moral of the story, monitor as much as you need to, and nothing more.  If you don't have the budget or resources to fix issues you find during your monitoring then it is best to consider NOT monitoring.  No need to make your performance worse with too much monitoring.  That will keep Heisenberg from rolling over in his grave.  


You have just read PerformanceCollector and the Heisenberg Uncertainty Group on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Data Quality

The NewThe New York Observer:
I was consulting with a healthcare IT vendor and one of their attorneys told me a story about a woman in labor who showed up at the hospital without her identity card (driver's license).  The hospital receptionist pulled up her data on their admissions system and turned her away because, according to their records, she already had the baby six months earlier.  Clearly this was a miscarriage (pun intended?) of justice/insurance/ObamaCare.  Turns out someone with some common sense stepped in and admitted her anyway.  I assume baby and mom were non-the-worse-for-wear.  Turns out her identity was stolen and apparently some other pregnant woman had a baby using her insurance/identity.  
 
Bizarre?  Not as bizarre as you may think if you've ever worked in healthcare IT.  Healthcare data quality is notoriously horrendous and, surprisingly, there's often some very good reasons for that.  
 
My Story Is Almost as Funny/Disturbing
(names, locations, and events have been changed to protect the guilty.  Otherwise, this is a true story...you can't make this stuff up!!!)
 
I worked on a project a long time ago to remove duplicate patients/medical records in an FDA-regulated medical "device" used at hospitals in a large metropolitan area.  Each hosptial/acute care facility/clinic/nursing home/etc had its own "device".  And each "device" had its own little database of patient information.  In the biz we call this the "medical record."  
 
This large metropolitan area has its own unique socialized medicine system that has evolved from what was once a bunch of independent, mostly religious-affiliated hospitals (presbyterian and jewish hospitals, for instance).  The oldest hospital in the system is close to 300 years old.  As you can imagine, after that many years of mergers, acquisitions, and consolidations there were a lot of duplicate patient records in each of these little device databases.  This causes problems, both from clinical and billing perspectives.  If PatientA gets a prescription at Hospital X and then goes to Hospital Y who has no knowledge of the prescription, possible drug interactions may occur.  You can imagine, I'm sure, what billing issues would arise from this as well.  Wouldn't it be nice if we could have a "best-of-breed" billing system that alerted each facility/device database/medical record of updates?  
 
Anyway, I was tasked with merging patient records.  I'm a firm believer in agile so I wanted to deliver something quickly that met a subset of the requirements so that user-acceptance testing could begin.  There was no magic with this project.  Very simply, using a simple algorithm I was able to merge patient records for "John A Smith" and "John Smith Jr" if they had 11 data points that were similar other than name.   Data points might be SSN, address, DOB. The hard part was merging the records together which, unfortunately, wasn't merely a case of changing MedicalRecordID 1234 to 6789.  (Bad normalization at play).  Basically, pretty simple requirements, pretty simple code.  
 
Within a couple of sprints I was done, QA signed off, and the "customer" was thrilled. The rollout went smoothly.  Since we had to merge MILLIONS of medical records the code took a "creepy/crawly" approach...it built some cursors (yes, there are good uses for cursors) and slowly started migrating data, ensuring we had no possibility of deadlocking or performance issues, migrating "active" data first, worrying about "archived" (read dead people) data later.  We could never do this during downtime (no such thing in the clinical healthcare space).  With smaller data sets the "devices" could all now communicate with each other and share patient records real time.  Performance was better, data quality was better, we even reclaimed disk space.  
 
So, where's the problem?
I arrived at work one day a month later and was told by a frantic VP that I needed to immediately initiate a rollback of the code.  He was out-of-breath and clearly frazzled, saying my code was faulty, etc etc, all the while gasping for air.  I tried to remain calm, "What's wrong? Is it something we can fix before we take drastic measures?"
 
"Heroin addicts can't get their methadone!"  
 
I kid you not, that was the exact statement.  "Come again?  Heroin addicts can't get their what?"  My mind was trying to figure out how my change could affect heroin addicts.  
 
The Problem
We did lots of testing so I felt confident in my code.  But at times like this I tend to panic too.  When you deal with FDA regulations you learn not to cut corners with QA.  Some of us developers pay off the DBAs with beer, in healthcare you pay off the QA folks (but not in beer).  You want QA to be coherent and you want them finding every bug.  You don't just pass off your code to QA, you make sure QA understands what you did and why.  Pairing and scrum are necessities in regulated industries.  
 
Pretty soon conference calls were arranged and I learned about the true nature of the problem.  In the good ol days before my code was in place, "John A Smith" would register at Hospital X and receive his methadone.  His patient record at Hospital A would be updated to reflect that his next methadone treatment was scheduled for x days.  Along with that he was given a subway token to get back home. How thoughtful.  
 
Ah, but nothing inspires ingenuity like a drug addict trying to get his next fix.  Entire TV shows are based on this simple premise.  So, instead of going home Mr Smith would use his subway token to go to Hospital B where he was also registered, albeit under John Smith Jr and a slightly different address or DOB.  Our Mr Smith would get his allotted methadone, again, and another subway token where he would go to Clinic C.  
 
Rinse and repeat, always repeat.  
 
After a month of my code being in place, with its "creepy/crawly features" the formerly-over-medicated patients were now experiencing rapid detox.  Perhaps I should take a step back.  Heroin is a dangerous, addictive opiate.  Methadone is a synthetic opioid that helps ween addicts off heroin if dosed carefully.  But, when taken in excess is just as addictive as heroin and can lead to overdose.  (I hope you learned something about drug culture in this blog post...you can now wow your friends with your new street cred).  
 
Now, I'm a very sheltered guy and it came as a complete surprise to me that rapidly detoxing drug addicts aren't a good thing to have running around a large metropolis.  Much better to have them over-medicated than under, I suppose.  
 
No one, from the customer (read government politicians who run this city's socialized medicine), to the BAs, to the QA people, to me, realized that de-duplicating data like this would be a problem...until it was a problem.  Only the hospital staff knew this would be a problem, having experienced these issues for years...but no one bothered to ask those users their opinion on some benign system change like purging duplicate medical records.  
 
In fact, months later we learned that it was a conscious decision by some of these clinics to ensure that patients were in there systems numerous times.  The system would limit the methadone dosing for patients that clearly needed more.  To override a dosing schedule would've put the prescriber on a "red list".  The easiest way to get a patient a bigger dose is to make a duplicate of the patient.  Once hospital staff worked with our software for years they knew the idiosyncracies that would allow them to enter a duplicate patient record by, say, altering an address from "1357 W 98th St" to "1357 W 98th Av".  
 
In Summary
A reasonable person might give me plaudits for my beautiful code that stopped this fraud...even if it was by accident.  But we're dealing with with the irrational.  The politicians were livid because, even though they now had more accurate medical records, and properly-medicated patients, they had a whole bunch of wild heroin addicts that couldn't get their overdose of methadone.  I had the change backed out in 48 hours and we eventually coded a system that eliminated duplicate patients from the devices...unless they were on methadone.  Those dups stayed.  
 
So, basically, we turned a blind eye to data quality...not to mention proper health care.  
 
Again, this story is modified to protect the guilty.  I assure you the actual circumstances were just as ridiculous.
 
True story... York Observer:
Lisa Schifferle, an attorney with the FTC, told me a story about a pregnant woman who after having her identity stolen 
 
showed up at the hospital in labor and was turned away because according to the hospital's records she had already had a 
 
baby six months earlier.
 
That's nothing.  My story is better.  I worked on a project to remove duplicate patient names in an FDA-regulated medical 
 
device used at many hospitals in NYC...which has its own unique socialized medicine system.  Using heuristics I was able 
 
to merge patient records for  John A Smith and John Smith Jr if they had 11 data points that were similar other than name. 
 
 The requirements and code were signed-off by all parties and everyone was happy.  
 
Until 2 weeks after it was "turned on" at all of the hospitals.  I arrived at work and was told that if I didn't roll back 
 
the change I would be fired on the spot.  I asked what was wrong.  
 
"Your system is working too well and our heroine addicts can't get their methadone!"  
 
Seriously, that was the statement.  Previously John A Smith would register at Hospital A and receive his methadone and a 
 
subway token to get back home and his patient record would be updated to reflect that his methadone was scheduled for 
 
refill in x weeks.  Instead of going home Mr Smith would use his subway ticket to go to Hospital B where he was registered 
 
as John Smith Jr and would receive his allotted methadone and a subway token.  He would take his token to Hospital C where 
 
he was registered as John A. Smith Jr ...you get the point.  
 
Now, one would think I should get plaudits for stopping this fraud...even if it was by accident.  Nope.  The politicians 
 
were livid because they now had more accurate hospital records but a whole bunch of wild heroine addicts that couldn't get 
 
their overdose of methadone.  I had the change backed out in 48 hours and we eventually coded a system that eliminated 
 
duplicate patients from the systems...unless they were on methadone.  
 
True story...

You have just read "Data Quality" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

The Sunk Cost Trap

"We want you to float on Project Automation and get back to us in a week with your thoughts.  We are a little concerned that this project has cost us $14 million and 14 months.  It is scheduled for code freeze in 4 months and we have yet to see a working proof-of-concept."  The worry was audible in the VP's voice.  
 
And this is how it starts.  I both relish and fear these moments.  On one hand I love coming in and saving a failing project, on the other hand, you never win and friends doing it.  You see, there was a customer perception that our enterprise software required too much manual intervention and Project Automation (PA) was meant to relieve some of that burden.  It was a contractual demand from our largest customers.  We either succeeded or we went out of business.  
 
I quickly began spending my days in the PA team room, at first absorbing what I was hearing, then asking questions, then offering suggestions.  When you get to that last stage the reticence is noticeable from your co-workers.  Within two weeks it was clear to me that most of the team viewed this as a death march project.  The team knew key aspects needed to change if it was to be a success but everyone was hesitant to make those changes knowing that there was only 4 months left.  Left alone, inertia guaranteed this project would fail.  
 
Within three weeks I had a working list of features that, as implemented, would NOT meet customer requirements regardless of time or money spent at this point.  The approach was wrong.  I also worked out a better approach that would meet the requirements, albeit scaled back for Version 1, and could be coded-up and unit tested within the remaining 3 months.  Granted, this would require some serious hunkering down and probably some OT, but we would show success.  
 
It took another week until I could convince the entire team that the only way they could show success was to ignore the sunk costs and change direction.  Within that week I had a somewhat-working proof-of-concept that I could demo and the team was won over.  It took another week to convince management to ignore the sunk cost and change direction.  Management was not happy about time and money being flushed away, but it was the only way to show success.  
 
We delivered Project Automation about a month late and with scaled-back requirements.  In the enterprise software development space, that's an unqualified success!  And we worked almost no overtime (I strongly believe in work/life balance).  By the time the release was GA'd we already had a service pack to address the missing requirements and even added a few new features that we realized were lacking after usability testing.  Customers were thrilled.  
"We want you to float on 'Project Automation' and get back to us in a week with your thoughts.  We are a little concerned that this project has cost us $14 million, 140 man-years, and 14 months of time.  It is scheduled for code freeze in 4 months and we have yet to see a working proof-of-concept."  If 'worry' has a sound it was audible in the VP's voice that day.  
 
And this is how it starts.  I both relish and fear these moments.  The moment that I am asked to work on an impossible project.  On one hand I love coming in late and saving a failing project, on the other hand, I never win friends doing it.  And I've failed at one of these opportunities.  That's not braggadocio.  There were times I went days without sleep to get to success.  It's just my nature, and I'm neither the only one nor the exception to the rule.  
 
Project Automation (PA) was critical to the future of our software.  You see, there was a customer perception that our enterprise software required too much manual intervention and PA was meant to relieve some of that burden.  It was a contractual demand from our largest customers.  We either succeeded at PA or we went out-of-business.  The details of what PA did are irrelevant to this story.  Just understand that it was critical.  
 
I wasn't the first person asked to help out with PA.  Like I said, there are lots of people far better than me.  For months management tried to get some control.  They tried the obvious things...shorter sprints, a move from scrum to kanban, enforced pair programming, more unit testing, less unit testing.  Nothing worked.  Those obvious things, while often helpful, will not turn around a failing project.  I know that, management sometimes does not.  
 
I quickly began spending my days in the PA team room, at first only listening.  Soon I began asking questions, then later offering suggestions.  When you get to that last stage the reticence is noticeable from your co-workers.  Within two weeks it was clear to me that most of the team viewed this as a death march project.  The team knew key components needed to change if it was to be a success but everyone was hesitant to make those changes knowing that there were only 4 months left.  Left alone, inertia guaranteed this project would fail.  
 
Within three weeks I had a working list of features that, as implemented, would NOT meet customer requirements regardless of time or money.  The approach was wrong.  I worked out a better approach that would meet the requirements, albeit scaled back, and could be coded-up and unit tested within the remaining 3 months.  Granted, this would require some serious hunkering down, but we would show success.  
 
It took another week until I could convince the entire team that the only way they could show success was to ignore the sunk costs and change direction.  Within that week I had a somewhat-working proof-of-concept that I demo'd and the team was won over.  It took another week to convince management to ignore the sunk cost and change direction.  Management was not happy about time and money being flushed away, but it was the only way to show success.  More out of desperation for their failing business than faith in my ideas, they acquiesced.  The PA team would change direction and do it my way.  
 
We delivered Project Automation about a month late and with scaled-back requirements.  In the enterprise software development space, that's an unqualified success!  And we worked almost no overtime (I strongly believe in work/life balance...and it was NHL playoff time too).  By the time the release was GA'd we already had a service pack to address the missing requirements and even added a few new features that we realized were lacking after usability testing.  Customers were thrilled.  We remain(ed) in business.  

Why did this project succeed?  It wasn't because I'm some kind of architect-superstar.  We know that's not true.  It was a success solely because I could convince others to ignore the sunk costs, especially so late in the development cycle, and switch focus to something else that would work instead.   

The Sunk Cost Trap

I feel I'm constantly explaining the sunk cost trap to project and product managers.  Recognizing a sunk cost trap when you see it in your software projects can save your company.  

A sunk cost in accounting parlance is a cost that has already been incurred and can't be recovered.  A sunk cost is not necessarily bad if it is a cost that can yield a positive return in the future.  The sunk cost trap arises when people cannot see that an investment will not ever show a return, yet they continue to waste resources pursuing it.  

There are lots of pithy aphorisms that deal with the sunk cost trap that you might've heard:  

  • "When you first find yourself in a hole, stop digging."
  • "Your first loss is your best loss."  
  • "Don't throw good money after bad."  
  • "Let bygones be bygones."  

Each of these aphorisms is conveying the message, more or less, not to fall for the sunk cost trap.  Yet people fall for this fallacy all the time.  We have the mistaken belief...it is really very irrational...that we've already committed a lot of resources to a given endeavor and we are past the point of no return.  We must continue with the chosen path and see it through.  The sunk cost trap, like all fallacies, can cost you and your company a lot of money when not identified.  

It is important as a software architect to not be a victim of fallacies.  This may even be the most important part of a software architect's job description.  

In your everyday life you probably fall for the sunk cost trap more than you realize.  Have you ever finished watching a two-hour movie that you were not enjoying after the first half hour?  You already had thirty minutes invested in it right?  It might get better so you keep watching.  Or you've finished a terrible dinner at a posh restaurant because you aren't the type to complain and walk out without paying.  If you're going to pay anyway, why not finish the meal?  Or you hold onto a bad stock in your portfolio because it is down too much to sell now, even though you'd never buy more of it today.  

Try to recognize fallacious activity. If you suddenly realize that a bad decision was made and it is preventing you from attaining your goals, and a better alternative exists, then you need to ensure that you change.  The truly good architects will do this...the mediocre architects will focus on the sunk costs and will try, come hell or high water, to make a subpar solution work at any cost.  

Prevention and Cure

Here are some tricks that I use to prevent sunk cost traps in my projects and overcome common objections when they do arise:   

  • "Fail early instead of fail late".  I preach this.  Never embark on a 6 month (or longer) project.  Instead, pare down the scope until you can have a proof-of-concept (POC) deliverable in a few sprints.  In your sprint retrospectives you should be evaluating whether or not the POC is working and if there are any alternatives that may be better.  POCs are really just "science projects" but many people are not willing to deviate after a few sprints due to the sunk cost trap.  I just finished a very lengthy blog series on various POCs I did using NoSQL solutions.  As soon as we found an issue that caused a given product to be sub-optimal for our needs we dropped that POC immediately.  We did not continue to throw good money at it just because we already wrote a few hundred lines of code.  I dropped a couple NoSQL POCs after just a few hours because of fundamental flaws I saw in the products given our requirements.  
  • Understand lost opportunity costs.  By continuing to invest in something that is sure to fail you lose TWICE.  You are flushing money down the toilet on the existing solution and you can't properly invest in an alternative that is more promising (that is the "lost opportunity cost").  Management fails to understand opportunity costs constantly.  For instance, senior, strategy-minded architects are forced to perform support tasks because there is a shortage of support personnel.  The opportunity cost is the senior architect cannot think about how to relieve the technical debt that causes inflated support costs in the first place.  
  • "This was a learning cost, albeit an expensive one".  At a minimum, make your mistakes educational so they won't happen again.  

For years after Project Automation I still had managers griping about the $14 million "learning cost" we spent before we changed direction.  This is entirely the wrong thing to focus on.  It isn't that we wasted $14 million, it is that we are still in business and are profitable.  I can assure you that we never had another fiasco like PA again.  Of course we made sure we had better project governance, but most importantly we recognized when the sunk cost trap was causing us to not do the right things.  

NoSQL Evaluation Summary Post

Over the last few months I've been summarizing my experiences with NoSQL products.  I have been evaluating these products as alternatives to SQL Server to save some licensing costs for one of my clients.  I tried to be as objective as possible as I've built my proofs of concept on these products.  In every case I'm happy to say that the POCs have been a success but I believe my client will be staying with SQL Server for now.  Some of these POCs and posts I have been working on for over a year.  

Here are all of the posts in the series:

Will MS Licensing Drive Customers to NoSQL?:  A summary of why I'm doing these POCs and what I hope to accomplish.  

Are some new SQL Server features a response to the NoSQL movement?:  Covers some interesting new features of recent SQL Server releases that appear to be directly challenging some of the features of various NoSQL solutions. 

What exactly is wrong with SQL and RDBMSs?:  What makes these products so enticing to developers and management?

Tutorial D, D, Rel and Their Relation to NoSQL:  What is driving the NoSQL solution...is it that SQL is just to darn how to use and not really expressive enough for relational problems.  I explore some shortcomings of SQL in this post.  

Data Models and Data Organization Methods:  compares how data is persisted and modeled in these non-relational models.  Don't try to shoehorn a relational design into a NoSQL solution.  

Eventual Consistency or ACIDs and BASEs:  You have to give up ACID properties (some of them anyway) when you adopt a NoSQL solution.  In their place you get BASE properties.  I cover those in this post.  

Handling Conflicts with Eventual Consistency and Distributed Systems:  One of the keep tenants of BASE is "eventual consistency".  This freaks out ACID people.  In this post I'll cover why this isn't so scary if you understand how it works under the covers, and some limitations.  

The CAP Theorem:  Consistency, Availability, and Partition Tolerance...you can have 2 of these 3 with any data management platform...RDBMS included.  We'll go over which of these you may be able to sacrifice and then which NoSQL products will support your requirements.  

Sharding:  Sharding is one of key methods used by NoSQL products to handle CAP and BASE issues.  Sharding is more than horizontal partitioning across nodes.  I cover sharding in this post.  

SAP HANA Evaluation:  This is the first POC I did and it was wildly successful.  The SQL syntax is much like TSQL and it really is blazing fast.  I'll show you how to setup your own HANA evaluation instance in the cloud.  

Graph Datastores:  covers what a graph datastore is and when you might opt to use one.  

Hekaton:  briefly goes over this new feature in SQL 2014 and how it is good for some NoSQL applications.  

Windows Azure Table Service:  this is a schema-less key-value store that Microsoft offers in the cloud.  

MapReduce for the RDBMS Guy:  I go over the minimum that every good relational data architect needs to know about MapReduce.  

Parallel Data Warehouse as a NoSQL Alternative:  this is another specialized offering from Microsoft that is very NoSQL-like.  It essentially gives you parallelized query execution on multiple nodes.  

MongoDB and CouchDB:  this briefly covers my evaluation of schema-less document stores.  

HBase: is a step-up from Hadoop, IMO, for cases where you expect a lot of random reads.  

Querying NoSQL with Hive and Pig:  I found HiveQL to be very easy to learn if you understand SQL.  This makes MapReduce SO MUCH EASIER to use when you need to get up to speed quickly.  

More NoSQL Solutions: is a brief write-up on some other products we evaluated.  We found quickly that they did not meet our requirements so we did not pursue them in earnest, but it's good to know they are available.  

Apache Flume:  is how you "stream" data into your Hadoop instance.  This is also the easiest way to load test data for POCs.  

SQL and PostgreSQL as NoSQL alternatives:  if you are just looking to save on licensing fees you may want to consider an open source alternative to your RDBMS vendor.  Or, consider forking the source code and making these alternatives fit your exact requirements...if you are THAT bold.  

On the Future of NoSQL:  I make some predictions about the bright future for NoSQL by briefly looking at the history of data management that led us to the ubiquity of relational data management and SQL.  

MySQL and PostgreSQL as NoSQL alternatives

PostgreSQLPostgreSQL
This is the next post in my NoSQL evaluation series.  I was tasked with finding alternative data persistence mechanisms for a customer because of the perceived cost associated with SQL Server.  The customer eventually decided to stick with SQL Server, for now, but initially they were very intrigued with porting part of the application to a NoSQL solution.  The biggest problem I had with this is that there was no in-house NoSQL expertise.  My fear is our developers would make a lot of rookie mistakes that would cost us money and time.  
 
I agreed with the initial premise that SQL Server was too expensive but steered the conversation toward cheaper RDBMS alternatives.  At least our people would have less of a learning curve.  Naturally MySQL and PostgresSQL were at the top of the list.  Furthermore, both products are open source and therefore we could extend either to make it a hybrid, customized PartialSQL solution.  
 
It wasn't terribly difficult for me, with limited experience with either product, to put together a proof-of-concept for a small portion of our application.  Both products are relational (or at least "set-based").  Both can use Navicat as a development/DBA GUI.  Navicat is my preferred GUI tool when I'm a bit stumped on esoteric SQL extensions in either of these products.  
 
Both support triggers.  
 
Both support variants of stored procedures.  In fact, I'd argue that MySQL support stored procedures closer to the SQL:2003 syntax than even SQL Server does.  If you need to migrated stored procedures then MySQL will be easier.  If you are familiar with Oracle then Postres will be a little easier.  It's PL/pgSQL is very close to PL/SQL and follows most of the same conventions.  For instance, in PL/SQL you can't return a result set directly from a stored procedure.  Instead you build a cursor over a result set and then pull from that.  Postres calls its stored procedures "stored functions" but they are functionally equivalent.  
 
I'm fascinated by history and the history of PostreSQL (or, just postgres) is kinda fascinating.  When System R, the IBM project that was the first RDBMS to use SQL, was being developed there were a bunch of research papers that went along with it.  These needed to be searched and cataloged.  Instead of using System R to do this a bunch of researchers created Ingres (INteractive Graphics REtrieval System).  This was open-sourced and gained quite a bit of traction.  Postgres is "post-ingres", or the next evolution of Ingres.  Great story to tell at IT cocktail parties.  
 
Either PostreSQL or MySQL make a great SQL Server alternative if you are price-conscious.  I find either to be very quick to setup and learn.  

Apache Flume

During my evaluation of NoSQL solutions the biggest hurdles I had, by far, was loading data into Hadoop.  The easiest way I found to do this was using Apache Flume.  But it took me a long time to figure out that my approach to data loading was wrong.  As a RDBMS data architect I was biased towards using techniques to load Hadoop that I would use to load a SQL Server.  I tried scripting (which is how most of the NoSQL solutions have you load their sample data, but tends not to work well with your "real" data) first but the learning curve was too high for a proof-of-concept.  I then tried using ETL tools like Informatica and had better success, but it was still too cumbersome.  

I began thinking like a NoSQL Guy and decided to use HiveQL.  I had better success getting the data in to Hadoop...but now I had to get the data out of my RDBMS in a NoSQL-optimized format that I could quickly use HiveQL against.  

As my journey continued I thought about how I would intend to get my data into Hadoop if we ever deployed it as a real RDBMS complement.  We would probably write something that "listened" for "interesting" data on the production system and then put it into Hadoop.  That listener is Apache Flume.  Why not just point Flume to listen in to our Service Broker events and scrape the data that way.  I had that up and running in a few hours.  

Flume works by having an agent running on a JVM listen for events (such as Service Broker messages to a JMS system).  The events are queued up into a "Channel".  The channels produce new outgoing events to Hadoop/HDFS/HBase (or whatever you use for persistence).  So, why use a channel in the middle?  Flexibility and asynchronicity.  The channels have disaster recovery mechanisms built in.  As your throughput needs change you can configure your agents to do fan-in (more agents talk to fewer channels) or fan-out (less agents talk to more channels).  The former is great if you have, for instance, multiple listeners that only need to talk to one Hadoop instance.  The latter is good if you have one source system that you want to talk to multiple Hadoop instances.  Or if you need to route messages to a standby channel to do maintenance on your Hadoop instance.  

This means that Flume is very scalable and can handle constant data streams effectively without worry of data loss.  This is great for something like streaming stock quotes.  Flume is by far the easiest way to load data quickly into Hadoop.  

More NoSQL Solutions

In this post in my NoSQL Series I wanted to briefly cover a few remaining NoSQL options.  I tested these briefly but did not try to tackle a full proof-of-concept with any of these due to time.  

Memcached is a giant associative in-memory array that can be scaled-out and sharded across multiple nodes.  I've used this in the past as a caching mechanism for websites I've created with great success.  We didn't feel in-memory was a good options for our requirements.  

Cassandra was originally an internal Facebook project.  It too is a distributed key/value store with a similar data model to BigTable.  Balanced sharding is native.  Querying is via Hadoop.  It is optimized for writes over reads which is a big difference to other distributed key stores.  

Redis is architecturally similar to Memcached.  I use this extensively for smaller web projects that I do.  It runs entirely in RAM but is persisted to disk.  My only complaint is that you get really excellent performance one minute, and horrendous performance the next.  There is no gradual degradation of performance.  This happens when your data set cannot fit fully into available RAM.  When I notice this happening I restart Redis, which effectively blows away my cache unfortunately, and everything hums again.  Sharding is your responsibility if you choose to scale-out.

Pages

Subscribe to RSS - Data Architecture