DaveWentzel.com            All Things Data

What is a Data Retention Plan (Part 1)

Every data architect must have been given a handbook that I never received that says something like, “you must have a data retention plan.”  If you ask the typical data architect what a good data retention plan is I’d be surprised if you got an answer more advanced than "I work with the business units to determine what the business needs, then I architect my solution accordingly."   Most simply say, "we purge our transactional data every x years."  I think these answers are lacking quite a bit, but they are the canned answers I hear from data architect candidates during interviews.  Here are some questions I don't think many data architects think about when it comes to data retention.  We'll start with the easy stuff first...and this list is by no means complete or conclusive:  

Technology Questions around Data Retention

  • Disaster recovery is the first thing.  Many people abbreviate Disaster Recovery as DR, but I think DR should really be Data Retention.  Basically, "disaster recovery" is a subset of "data retention", in my opinion.  For instance, in case of disaster I may not require zero data loss for all data available.  I may require zero data loss for transactional data but may tolerate total data loss of my DW if my ETL processes can regenerate the data for me (and do it quickly enough).  If not, I may tolerate data loss in my DW for data x years or older.  Now we are talking about different recovery models in the DBMS depending on the data.  You better work with your DBAs for that and you better understand your DBMS to be sure it can do whatever you propose.  
  • Tiered storage also falls under data retention in my mind.  I may use Tier 3 storage to never have to purge data that is x years or older, at the sacrifice of IOPS and recoverability.  As a data architect I would never allow these decisions to be made solely by my DBA or SAN engineer.  
  • If your backup systems are upgraded will you still be able to get to that old archive data from 10 years ago?  I'll bet you are using a different backup technology/media/software vendor than you used 10 years ago.  Again, some data architects allow these decisions to be the domain of the backup engineer.  I think that is a big mistake.  You need to control this.  Input from the backup engineer is fine, but it is the data architect's job to understand the processes and sign off that they meet requirements.  
  • You should think more about the data's future migration rather than its permanence.  This is the rule I use.  In that regard it is absolutely necessary, in my opinion, to have data structures that note the provenance of the data (data lineage), what business rules engines and versions processed that data, and the data element's "meaning".  You'd be surprised how different systems define a "Job Code".  A field name is not enough to denote meaning.  If you can map it to an enterprise canonical data model that should likely be enough.  Even better is metadata stored in the database that denotes all of this. 
  • Consider the use of a NoSQL solution to maintain more data at lower cost.  You know you've wanted to try a NoSQL solution, data retention might be a way to weasel one into your organization.  
  • Really understand what "purge" means.  If the business says "purge my data after x years", understand what that means.  Do you just remove it from the transactional database, or from the DW too?  How about from my backups?  For regulatory (or even CYA reasons) you may have to purge that data from EVERYWHERE.  The reason I have this under the "technology" section is because the data architect must know everywhere a datum will travel and if it is even purge-able there.  Think enterprise content indexers.  As a data architect, are you sure you can purge the data?  Most people say, "I'll just have a DBA write a routine when the time comes and I need to purge."  Well, I don't know many DBAs that can write a purge script from a SharePoint index server.  I'm seeing more and more regulatory cases where the organization would've been OK if they would've purged based on the regulation, but didn't.  It can come back to haunt you. 
  • Review and test your archiving procedures regularly.  That's probably common sense.  
  • Understand Data Security.  I suggest reading some good books on the topic, specifically The Myths of Security.  This gives you the contrarian viewpoint to what many data security specialists will tell you.  You don't need to understand SHA and 3DES encryption as a data architect (for data retention purposes), but you need to understand how to keep your data secure and be able to converse with your security people using their lingo.  

Technology considerations for data retention are easy to grasp for most data architects.  In the next post I'll cover the business and legal questions around data retention.