DaveWentzel.com            All Things Data

Data deduplication for the data architect

Data deduplication is definitely one of the hot IT buzzwords right now.  Data deduplication is a data compression method for removing redundant data from persisted storage.  The typical example is a large pdf attached to an email and sent to 20 people, resulting in 20 copies of the file on the Exchange Server.  Data deduplication techniques save one copy of the file with pointers to the others.  There are many, many other methods of doing this.  It can be done as a post-processing task, handled at the SAN, handled when backups occur, etc.   

So, how does this affect the data architect?  First, understand the technology so you can speak about it intelligently when senior management comes inquiring.  Next, understand if data deduplication is applicable to your databases.   It probably isn't.  

At the individual database level, databases are intentionally designed with appropriate normalization in mind for the given application.  We tend to tolerate strategically duplicated data for read performance reasons.  If your data does have some unnecessary duplication in it, data deduplication technology won't help (much), you need to refactor the design.  

At the data storage level, data deduplication technologies likely won't help here either.  Many DBMS vendors' actual data files are far more bloated than the data they contain.  Again, this is for performance reason...disk space is cheap and the vendors optimize their data storage to attack queries in the most performant manner.  So even if your database has tons of duplicated data, it won't be immediately evident by looking at on-disk data storage patterns.  Having said that, some column-store vendors compress their data so disk reads are faster.  While this is true this is a trade-off they have consciously made and they have optimized their compression/decompression algorithms to handle this correctly.  The typical DBMS vendor has not optimized their code for data dedup technologies.  

At the data backup level we could implement data dedup, especially if we are using some kind of copy-on-write mechanism on our SAN.  If your data backups are encrypted at backup time it's likely data dedup won't help here either though since the encryption will obfuscate any underlying data patterns that could be compressed.  


Data deduplication shouldn't come into play in the database world for the data architect.  Our systems should be designed for speed with some redundant data built in after careful consideration.  We shouldn't implement a technology that will undo these decisions.  Regarding data deduplication of backups and other copies of data I don't think it matters to the data architect as long as those backups/copies are available when we need them.  I do believe there will be more talk of data deduplication as senior management sees the cost savings on their expensive SANs.  I see this already with Virtualized SANs and Thin Provisioned Storage, both of which tend to affect performance of our databases negatively even though they save us money.    

Add new comment