DaveWentzel.com            All Things Data


Storm Notes

Need to get up-to-speed quickly on the basics of Apache Storm?  This quick post will give you everything you need to know to speak somewhat competently about Storm.  You'll learn the general architecture, terms, proper use cases, and basic administration and performance management.  

Couchbase Notes

If you have 10 minutes this blog post will give you a quick overview of what Couchbase is, how it works, and some gotchas.  Couchbase is a document-oriented, key-value store that is very simplistic.  But it's also rock solid.  If you are evaluating document stores and you need to come up to speed quickly on Couchbase then this post is for you.  


Kafka Notes

I'm learning Apache Kafka which is a wicked-fast asynchronous messaging system used, generally, as a source spout for Apache Storm.  In general, in hadoop the loading/streaming data patterns use Kafka by producing a message to a Kafka "topic" and then consuming it later.  This post is just my notes on how to administer and troubleshoot Apache Kafka.  

Polyglot Database Development with 0xDBE

This is an unsolicited review of OxDBE, a universal SQL IDE by JetBrains.  Universal SQL IDEs are meant to make your life easier developing code for multiple DBMSs.  Instead of learning the nuances of multiple DBMS IDEs you can spend your time learning the nuances of each vendor's SQL dialect.  The problem is universal SQL IDEs tend to be either pricey or buggy.  0xDBE is polished and has almost no learning curve if you are familiar with IntelliJ.   Try it while it's still free.  

(This is an unsolicited review of a product I really love.)  

HP Vertica

I've been using HP Vertica since before HP bought the company.  I first wrote about Vertica in 2010 when I wrote about column-oriented databases.  I'm again using Vertica in my daily responsibilities and I decided to try to get my Vertica certification.  I'm not a Vertica expert but I find taking certification exams helps you also understand your shortcomings of a product.  For instance, I passed my SNIA (SAN) certification on the first try with a perfect score.  However, this was just dumb luck.  Every question on NAS and tape storage I guessed at, and the right answers were fairly obvious.  Since I'm not a full-time SAN guy, I'm a data professional, I don't have much need for tape and NAS so I really didn't care to learn more about those topics.  But it was interesting learning what SNIA feels is important in the storage world.  

In the process of getting my Vertica certification I thought it would be wise to put together a blog series on Vertica for anyone else that wants to learn this technology rapidly in a hands-on fashion.  In these blog posts I'll cover what Vertica is, how to install it, we'll migrate AdventureWorks to it, and we'll do lots of fun labs along the way.  The posts will be geared toward those data folks who are familiar with MS SQL Server.  I specifically approach learning Vertica by calling out its differences with SQL Server. 

You'll find this is a lot of fun.  Vertica is pretty cool.  

Briefly, what is HP Vertica?

It is a column-oriented, compressed, relational DBMS.  Many people consider this a NoSQL solution, but it does use a dialect of SQL for its manipulations.  It is clustered across grids and nodes like many distributed NoSQL solutions, with builtin data redundancy (called k-safety), which means it has the typical marketing gimmick of "requiring no DBAs".  I can assure you that it performs QUITE NICELY for read-intensive workloads.  There is also an entire analytics platform that comes with Vertica...roughly equivalent to SSAS.  

What makes Vertica unique is that it persists the data in groupings based on the column of the tuple (table) instead of row-oriented, traditional, RDBMS offerings.  If you think of a table as a spreadsheet then retrieving a single row is an ISAM-type of operation.  This works very well for OLTP applications.  But in some reporting applications it is possible that you care more about an entire column than about the individual rows.  If your query is something like "give me the total sales for this year" then querying a columnstore will result in far less IO and will run radically faster.  

Even the amount of utilized disk to store the data will be less.  Columstores compress much better because like data types are grouped together.  Basically, there are more rows than columns and each row will need "offset" information to store the different datatypes together.  You'll need fewer offset markers if you organize your storage by column.  TANSTAAFL (there ain't no such thing as a free lunch), as economists say...the cost is that updates to existing rows require one write per column in the table.  So a columnstore is probably not best for your OLTP tables.  

That's a quick overview and is not all-inclusive.  I'll cover lots more in the next few posts.  

What's it cost?

The Community Edition is free for up to 1TB of data.  And if you consider that compressed columnstores generally take up a fraction of the space of a traditional row-oriented DBMS...that's actually a lot of data.  

HP Vertica Documentation

Vertica documentation in HTML and PDF format

Certification Process

Certification Exam Prep Guide

Getting Started

You need to register with HP to be able to download the Community Edition.  You can do this at my.vertica.com.  After you register you'll immediately be able to download the CE (or the drivers, Management Console, VMs, AWS documentation, etc).  

In the next post I'll cover how to install Vertica on Ubuntu.  You can of course download a VM instead but I believe the best way to learn a product is to start by actually installing the bits.  

You have just read HP Vertica on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Is it Pokemon or BigData?

Is it Pokemon or BigData?

This is pretty funny.  You are given a name, say "Crebase", and you must determine whether that is a Pokemon character or a BigData technology.  

I scored a 60%, which means either I don't know my BigData esoterica, or I don't know Pokemon.


NoSQL Evaluation Summary Post

Over the last few months I've been summarizing my experiences with NoSQL products.  I have been evaluating these products as alternatives to SQL Server to save some licensing costs for one of my clients.  I tried to be as objective as possible as I've built my proofs of concept on these products.  In every case I'm happy to say that the POCs have been a success but I believe my client will be staying with SQL Server for now.  Some of these POCs and posts I have been working on for over a year.  

Here are all of the posts in the series:

Will MS Licensing Drive Customers to NoSQL?:  A summary of why I'm doing these POCs and what I hope to accomplish.  

Are some new SQL Server features a response to the NoSQL movement?:  Covers some interesting new features of recent SQL Server releases that appear to be directly challenging some of the features of various NoSQL solutions. 

What exactly is wrong with SQL and RDBMSs?:  What makes these products so enticing to developers and management?

Tutorial D, D, Rel and Their Relation to NoSQL:  What is driving the NoSQL solution...is it that SQL is just to darn how to use and not really expressive enough for relational problems.  I explore some shortcomings of SQL in this post.  

Data Models and Data Organization Methods:  compares how data is persisted and modeled in these non-relational models.  Don't try to shoehorn a relational design into a NoSQL solution.  

Eventual Consistency or ACIDs and BASEs:  You have to give up ACID properties (some of them anyway) when you adopt a NoSQL solution.  In their place you get BASE properties.  I cover those in this post.  

Handling Conflicts with Eventual Consistency and Distributed Systems:  One of the keep tenants of BASE is "eventual consistency".  This freaks out ACID people.  In this post I'll cover why this isn't so scary if you understand how it works under the covers, and some limitations.  

The CAP Theorem:  Consistency, Availability, and Partition Tolerance...you can have 2 of these 3 with any data management platform...RDBMS included.  We'll go over which of these you may be able to sacrifice and then which NoSQL products will support your requirements.  

Sharding:  Sharding is one of key methods used by NoSQL products to handle CAP and BASE issues.  Sharding is more than horizontal partitioning across nodes.  I cover sharding in this post.  

SAP HANA Evaluation:  This is the first POC I did and it was wildly successful.  The SQL syntax is much like TSQL and it really is blazing fast.  I'll show you how to setup your own HANA evaluation instance in the cloud.  

Graph Datastores:  covers what a graph datastore is and when you might opt to use one.  

Hekaton:  briefly goes over this new feature in SQL 2014 and how it is good for some NoSQL applications.  

Windows Azure Table Service:  this is a schema-less key-value store that Microsoft offers in the cloud.  

MapReduce for the RDBMS Guy:  I go over the minimum that every good relational data architect needs to know about MapReduce.  

Parallel Data Warehouse as a NoSQL Alternative:  this is another specialized offering from Microsoft that is very NoSQL-like.  It essentially gives you parallelized query execution on multiple nodes.  

MongoDB and CouchDB:  this briefly covers my evaluation of schema-less document stores.  

HBase: is a step-up from Hadoop, IMO, for cases where you expect a lot of random reads.  

Querying NoSQL with Hive and Pig:  I found HiveQL to be very easy to learn if you understand SQL.  This makes MapReduce SO MUCH EASIER to use when you need to get up to speed quickly.  

More NoSQL Solutions: is a brief write-up on some other products we evaluated.  We found quickly that they did not meet our requirements so we did not pursue them in earnest, but it's good to know they are available.  

Apache Flume:  is how you "stream" data into your Hadoop instance.  This is also the easiest way to load test data for POCs.  

SQL and PostgreSQL as NoSQL alternatives:  if you are just looking to save on licensing fees you may want to consider an open source alternative to your RDBMS vendor.  Or, consider forking the source code and making these alternatives fit your exact requirements...if you are THAT bold.  

On the Future of NoSQL:  I make some predictions about the bright future for NoSQL by briefly looking at the history of data management that led us to the ubiquity of relational data management and SQL.  

On the Future of NoSQL

In this NoSQL post I want to share my predictions for the future of NoSQL.  But to understand my predictions you must understand the history of data management.  This will be REAL brief.  It kinda all starts with Codasyl.  Codasyl (Conference on Data Systems Languages) was a consortium started around 1959 to develop a standard data programming language that could run anywhere and was easy to learn.  The first attempt gave us COBOL (COmmon Business Oriented Language).  Within 10 years COBOL evolved to include a DML and DDL language that, though not relational (more like a network model), worked well.  At some point there was a realization that a specific higher-level declarative "data" language was needed.  This was the early 1970s and IBM was looking for a data language for their relational System R and implemented and championed SQL.  That was a VERY concise history lesson.  Codasyl is mostly dead today due to the advent of relational models and "SQL" but they do have some niches.  Healthcare (one of my areas of expertise) is one.  In my mind this means that either a)businesses are too cheap to rewrite these apps into something more modern/relational or b)they just work and relational/RDBMSs maybe aren't the best solution.  

But why did relational managers take off while network data managers died out?  The biggest reason is that most business problems are better modeled relationally.  Next, relational vendors had all adopted SQL. Network/hierarchy models had disparate, fragmented languages.  Suddenly things were cross-vendor-capable in the relational world, and even when vendor SQLs weren't perfectly compatible (even today PL/SQL is definitely different from TSQL) at least the learning curve wasn't very steep (for instance, JOIN syntax is a little different among vendors, but semantically the same).  That cannot be said for hierarchical and network managers.  They each had their own language.  Some looked Algol-ish, others looked like Fortran, etc.  No standards = less adoption.  

The lesson is low-level user interfaces for data manipulation is not a good idea if you want market adoption.  

SQL, for all its faults, is easy to learn.  Not so for network and hierarchy managers.  MUMPS (a hierarchical system that is older than me and still used in healthcare today) has a horrendous language (called M) to learn.  

If you've never written a "query" without a declarative language, like SQL, on a old-time database manager...well, you're missing out on some real fun.  Imagine writing code to open a "file" on a tape and implementing a "nested loops" construct to join it to another data file.  You might have heard of this before...it's called ISAM (or, later, VSAM, which was a little better) and it is up to you to understand the physical retrieval of the data vs just worrying about declare what you want (like SQL).  Worrying about physical retrieval of data is fraught with danger.  Lots of bugs if you are a novice and don't understand hashing and indexing.  
These physical data retrieval issues were soon identified and rectified with ISAM after it was first released.  It didn't take long before IBM realized that a "query optimizer" embedded in the product could determine for the business user how best to tackle a query.  And then ISAM got popular.  
To repeat...the lesson is low-level user interfaces for data manipulation is not a good idea if you want market penetration.  Higher level languages brought relational databases to the masses.  You didn't need to be a programmer to use them.  The same thing will happen with NoSQL.  That's my prediction.  When the process of writing MapReduce jobs is no more complicated than writing a HiveQL statement, then NoSQL products will really take off.  
We seem to be repeating history a bit and re-learning the lessons of the past.  Writing MapReduce seems to me to be equivalent to writing ISAM code.  It's tedious, can perform poorly if your knowledge of how data persistence works isn't good, and is prone to error.  This isn't the first time our industry has retrogressed in data management.  15 years ago XML was being touted as the app that would finally kill the relational database.  Obviously "tag bloat" is probably the biggest reason that never occurred, but another reason is simply the effort and learning curve needed to query XML.  It's getting better, but it's still not easy enough for dumb guys like me.  And look at how many methods exist to query XML...XQuery, XQL, XPath, XSLT.  That's not user-friendly...and user-friendly is how you win business.  
So what NoSQL needs, IMHO, is a standards-based approach to querying that the average business person can use.  Something like HiveQL.  And a lot of the vendors understand this too and are developing really good tools that are close enough to SQL that the average business person will enjoy learning them.  
I did.  
But the problem is there is no standard, so each vendor's query language can be very different.  But the common denominator is they all try to model themselves on SQL.  
I've already written about HiveQL as an example of a declarative language.  Another example is CQL, which is Cassandra's query language.  I did not get a chance to work with CQL closely, but the DML syntax, at least, is almost exactly like ANSI SQL.  There are of course extensions to the language to handle those items unique to Cassandra and NoSQL that you won't see in a relational language, such as BATCH (makes multiple DML statements work atomically), CREATE KEYSPACE (document definitions), and CONSISTENCY (concurrency isolation level control).  There's the rub...those "statements" that are specific to non-relational products need to be standardized.  KEYSPACE in one product is DOCUMENT in another.  

As of late 2011 Couchbase even has a "SQL" dialect originally called UnQL (pronounced "uncle").  They have since renamed it to N1QL (pronounced "nickel").  It is still in beta but shows promise.  Again, we dismissed Couchbase due to requirements, otherwise I think it would've been interesting to learn nickel (and hopefully I wouldn't have to cry "uncle" trying to get it to work).  N1QL also provides a series of DDL commands, just like ANSI SQL.  Couchbase has an excellent tutorial with sample documents and how to query them effectively.  

There's a trend here.  More complete higher level languages will win more converts to NoSQL.  In fact, if we could somehow get a proposed NoSQL language standard, like Codasyl championed, we may see mass adoption.  Write MapReduce or curl or java just won't cut it.  
Lastly, unlike most of my RDBMS collegues, I do believe there is a need for NoSQL products on both a logical and physical level. On the relational theory side, there are some things that just can't be modeled well relationally.  Graph databases fill one such void.  Think about the web of connections in your LinkedIn profile.  It's difficult to model these relationships relationally.  This is probably because most modelers (myself included) feel more at home with physical modeling than logical modeling.  The physical model for something like LinkedIn, if done in a purely relational database manager would be horrible.  In a graph db things are much easier.  Modeling hierarchies and transitive closure-type relationships is conceptually difficult for most people in a relational model.  Why not just stick with a product that is actually hierarchical.  
Regarding implementation/physical rationales for NoSQL, as we've seen in these posts, there are cases where the physical persistence implementation by the major RDBMS vendors is not always optimal.  There are excellent use cases for columnstores for instance.  Or totally in-memory RDBMSs (SAP HANA).  Or RDBMSs that natively support sharding while hiding the implementation details (Oracle grid stuff).  If there weren't, then we wouldn't see these features being bolted-on to the RDBMS vendors' products with every new release.    
I think the future of NoSQL is bright and I'm looking forward to working with these products and better understanding the technologies and theories behind them.  

MySQL and PostgreSQL as NoSQL alternatives

This is the next post in my NoSQL evaluation series.  I was tasked with finding alternative data persistence mechanisms for a customer because of the perceived cost associated with SQL Server.  The customer eventually decided to stick with SQL Server, for now, but initially they were very intrigued with porting part of the application to a NoSQL solution.  The biggest problem I had with this is that there was no in-house NoSQL expertise.  My fear is our developers would make a lot of rookie mistakes that would cost us money and time.  
I agreed with the initial premise that SQL Server was too expensive but steered the conversation toward cheaper RDBMS alternatives.  At least our people would have less of a learning curve.  Naturally MySQL and PostgresSQL were at the top of the list.  Furthermore, both products are open source and therefore we could extend either to make it a hybrid, customized PartialSQL solution.  
It wasn't terribly difficult for me, with limited experience with either product, to put together a proof-of-concept for a small portion of our application.  Both products are relational (or at least "set-based").  Both can use Navicat as a development/DBA GUI.  Navicat is my preferred GUI tool when I'm a bit stumped on esoteric SQL extensions in either of these products.  
Both support triggers.  
Both support variants of stored procedures.  In fact, I'd argue that MySQL support stored procedures closer to the SQL:2003 syntax than even SQL Server does.  If you need to migrated stored procedures then MySQL will be easier.  If you are familiar with Oracle then Postres will be a little easier.  It's PL/pgSQL is very close to PL/SQL and follows most of the same conventions.  For instance, in PL/SQL you can't return a result set directly from a stored procedure.  Instead you build a cursor over a result set and then pull from that.  Postres calls its stored procedures "stored functions" but they are functionally equivalent.  
I'm fascinated by history and the history of PostreSQL (or, just postgres) is kinda fascinating.  When System R, the IBM project that was the first RDBMS to use SQL, was being developed there were a bunch of research papers that went along with it.  These needed to be searched and cataloged.  Instead of using System R to do this a bunch of researchers created Ingres (INteractive Graphics REtrieval System).  This was open-sourced and gained quite a bit of traction.  Postgres is "post-ingres", or the next evolution of Ingres.  Great story to tell at IT cocktail parties.  
Either PostreSQL or MySQL make a great SQL Server alternative if you are price-conscious.  I find either to be very quick to setup and learn.  

Apache Flume

During my evaluation of NoSQL solutions the biggest hurdles I had, by far, was loading data into Hadoop.  The easiest way I found to do this was using Apache Flume.  But it took me a long time to figure out that my approach to data loading was wrong.  As a RDBMS data architect I was biased towards using techniques to load Hadoop that I would use to load a SQL Server.  I tried scripting (which is how most of the NoSQL solutions have you load their sample data, but tends not to work well with your "real" data) first but the learning curve was too high for a proof-of-concept.  I then tried using ETL tools like Informatica and had better success, but it was still too cumbersome.  

I began thinking like a NoSQL Guy and decided to use HiveQL.  I had better success getting the data in to Hadoop...but now I had to get the data out of my RDBMS in a NoSQL-optimized format that I could quickly use HiveQL against.  

As my journey continued I thought about how I would intend to get my data into Hadoop if we ever deployed it as a real RDBMS complement.  We would probably write something that "listened" for "interesting" data on the production system and then put it into Hadoop.  That listener is Apache Flume.  Why not just point Flume to listen in to our Service Broker events and scrape the data that way.  I had that up and running in a few hours.  

Flume works by having an agent running on a JVM listen for events (such as Service Broker messages to a JMS system).  The events are queued up into a "Channel".  The channels produce new outgoing events to Hadoop/HDFS/HBase (or whatever you use for persistence).  So, why use a channel in the middle?  Flexibility and asynchronicity.  The channels have disaster recovery mechanisms built in.  As your throughput needs change you can configure your agents to do fan-in (more agents talk to fewer channels) or fan-out (less agents talk to more channels).  The former is great if you have, for instance, multiple listeners that only need to talk to one Hadoop instance.  The latter is good if you have one source system that you want to talk to multiple Hadoop instances.  Or if you need to route messages to a standby channel to do maintenance on your Hadoop instance.  

This means that Flume is very scalable and can handle constant data streams effectively without worry of data loss.  This is great for something like streaming stock quotes.  Flume is by far the easiest way to load data quickly into Hadoop.  


Subscribe to RSS - NoSQL