DaveWentzel.com All Things Data
I briefly wrote about these a few weeks ago in this blog post. I believe Michael Stonebraker is credited with "inventing" this technology, although it's really been around since the 1970's. I was tasked at evaluating column-oriented databases after management read about these in a trade journal and thought it might be applicable for our fast-growing analytics system. He's possibly right. Fast data access is the cornerstone.
What are these things?
It might be more helpful to ask what is a traditional RDBMS and to really think about it from a non-traditional, more historical standpoint. The traditional RDBMS was created to retrieve data "a row at a time" in a more efficient manner than say dBASE or rbase or an ISAM system could do it. 30 years ago these were the apps that were needed and desired. Get me this customer record now, then save the update, insert a new customer, delete a customer, etc. All done transactionally.
Now think about a data warehouse. Does it really need this? Does it need to be transactional? Does it really read and retrieve a row at a time? Nope. Instead, the database needs to be far more query-intensive. We need to be concerned with the chief bottleneck of the data warehouse, the disk. Now, yes, SQL is a wonderful query language and modern RDBMSs are optimized to the hilt to make query access as performant as possible. But what happens when you are still hitting the performance wall?
A column-oriented db changes the paradigm. If you know Oracle or SQL Server you know when designing your tables to choose datatypes and widths to make your row fit on a page. You still do that with a column-store, but now I'm not really as concerned with getting the "row" on the page, as I am about getting as values from a single (or a few) column(s) as possible on a single data page.
A typical fact table in DW may have, say, 80 cols. At any given time I may be interested in only a few of those cols...maybe the keys and a few measures. I'm probably not interested, say, in the "audit" cols (DateCreated, CreatedBy, ModifiedBy, etc), nor the data provenance cols, etc. In a typical RDBMS when I read the disk I'm actually reading those cols since storage is organized by row (typically...more later). Why not instead have the physical storage organized such that the important cols are grouped together closely?
Another way to think of it
Unless someone changes the rules of mathematics and set based theory, a data set is always expressed in 2 dimensions. Something like this.
In a traditional RDBMS that data is persisted to disk similar in concept to a simple csv string:
In a column store the data is stored more like this
This might be a bad example since when querying for LastName I would likely always want FirstName too, but what if I really didn't care for some reason? Note that scanning the data would be much faster using the column store. If your browser is displaying this page in a fixed-width font you should also note that the column store structure uses less storage. Why? Note there is one fewer semicolon. Why? I am organizing the data by column vs by row and there are only 3 columns, but 4 rows.
What are some other differences?
- Transactions, although supported, are not the goal, so performance here tends to suffer. Since a row's columns are scattered around the disk an UPDATE, although possible, will be very expensive. So will DELETEs for the same reason. INSERTs sound like they would be costly too, but tend not to be. How? Well, if you are going to INSERT quickly and transactionally performance will suffer, but loading data in bulk is OK, or even trickling them in is OK, if we can optimize the load such that the columns are in the correct order. A good ETL developer does this already...get the data staged and ready in the form that will be the fastest to import, including using tricks with partitions and switching.
- Most vendors will not have the concept of using transaction/REDO logs for recovery purposes which add to write overhead. Instead, recovery is guaranteed by replicating the data to other nodes in the grid. HA is inherent, no need for cold spares or warm standbys that are a waste of money.
- As mentioned above, less storage space is required. Since disk is the slowest component of any machine the performance implications should be apparent.
- Since any single col will have the same data type that means all of the ints can be lumped together and separated from the VARCHAR2's. So I can start to utilize compression smartly. Vertica takes it a step further and has optimized their query processor to be able to act on the compressed data without decompressing it first.
- Multiple copies of the data are stored in different sorts to optimize queries. This is similar to having multiple clustered indexes in SQL Server (or Indexed Views) without having to worry about UPDATE performance or bookmark lookups.
Can't I do any of this in my traditonal RDBMS?
Yup. Traditional RDBMS vendors are watching the column store companies like Vertica to see what their research is uncovering. One feature that will help you is basic partitioning, materialized/indexed views, and plain vanilla indexes. Even OLAP cubes, although not *really* a part of the RDBMS basically still help.
How else? Buy more RAM, use smaller data types, consider SSD's. Anything you can do to get the disk seek times down (and this technology is focused on eliminating the seek by making the scan a better choice) will help.
What's old is new again. You can use SAN storage with a column store, but it's not necessary and possibly not even desirable. With grids,shared-nothing architectures, and redundant data storage it might be smarter to have the data persisted to local storage. Your SAN is probably a higher RAID level than you need and since performance is tantamount in a column-store, a RAID 0 might be the best choice, implemented in DAS.
I can't just replace my Data Warehouse with a column-store, management would kill me
Right. DW's have a political connotation at many organizations. Saying that performance sucks and the solution is a column-store may not win you many friends at your place. But you could feed the column-store from the DW or ODS instead, and then begin using the column-store for those applications with the data demands that require it, such as real-time analytical reporting.
I'm going to try it, what are the gotchas
We've all seen some touting by OO databases as to speed but get burned when we learn they don't support ANSI SQL. Make sure your column-store does unless you want to retrain everyone. Make sure your vendor supports your hardware/OS.
The one vendor I have some experience with in the column store space is Vertica. It uses a shared-nothing, grid-based architecture of commodity Linux servers. It has what they call a "hybrid data store" where data can be inserted quickly into a write-optimized store (WOS) to allow real-time data analytics and fast data insertion, and then the standard read-optimized store (ROS). Both are column-stores. Vertica provides a nice toolset to help you make physical design decisions which is helpful since existing data management rules don't apply to column stores. These guys are the leader in column store technology.
It's amazing how what was old is now new again. Although people claim this is a new technology, it really isn't. SybaseIQ has been around for awhile. And way back when, in the dark ages before commercial RDBMSs, people wrote their own "RDBMSs" and they of course structured their storage to meet their needs and the result was often column-oriented storage. I wonder how long it will be before we see network (CODASYL) and hierarchical data models come back en vogue?
But what do you recommend to management? More research is needed. I don't think we've tapped the limits of our hardware yet. I also think that time spent re-architecting for a column-store could just as well be spent looking at putting our data on an I/O diet. With a column-store we would still need DBAs well-versed in the technology...maybe we need people better versed in row-oriented storage performance? Maybe we need to understand the SAN better instead?
I'm old school and am adverse to change, but I wouldn't stake my reputation or employment on these "newer" technologies. I'm just scared the query optimizer isn't mature enough, we couldn't get the support needed, and we don't have the in-house talent. But this is something to watch closely and try to understand how it affects your world and how you can translate it's benefits into something you can use today.