Now, ask your CMO if she cares if reported revenue is off by $2 million. You’ll hear a very similar diatribe: “Our data quality is horrible. How can we be expected to understand our business with numbers that aren’t accurate?” But, if you start probing and you show that the $2 million error is only a 0.5% variance-to-actual, and the trend is accurate, the CMO will calm down: “Well, what we really need to know isn’t the actual number, we need to know what to do next.” Exactly! Data quality is less important than you thought it was. For an accountant a data quality bug, even if it’s 2 cents, is critical. For a CMO, a bug that is SEVEN ORDERS OF MAGNITUDE WORSE ($2M vs $0.02), really doesn’t matter
Point is: solving data quality for data quality’s sake, or starting an MDM project because your IT/data team told you to, is not a recipe for a high value, high return project. Find out if quality really matters.
There are 7 recognized core tenets of data quality. Let’s look at each one and determine just how important each really is (just my opinion). This may surprise you.
Tenet | What is it? | Why is it important? | How do we fix it? |
---|---|---|---|
Completeness | Is all requisite information available? | Why is data incomplete? Sometimes data is initially incomplete but subsequent events add to the missing data. An SME can likely tell you why data appears incomplete. | Bug or misunderstanding. |
Consistency | Does the analytics data match the System of Record’s data | If this is the problem, you have an ETL bug. | Fix it or prioritize it on the backlog. Consistency issues are always legitimate problems, but usually are easy to fix. |
Duplication | Multiple instances of the same data | You either have a bug or you have a misunderstanding of the data. Usually duplicate data is data that is enriched during subsequent events. See completeness above | Bug or misunderstanding. |
Conformity | The data adheres to corporate standards | Determine why data doesn’t conform, there may be a good reason. Example: varying DATE formats are confusing, but sometimes cannot be changed in the system of record | Bug or misunderstanding. |
Accuracy | The data reflects reality | Accuracy issues are usually due to bugs in existing processes in the Systems of Record. | Fix the bug. |
Timeliness | Data is available when needed | This is more about system design than DQ. I always design data systems so they can be real-time if ever needed, without rework. | System Design problem, not DQ |
History | Data has a lifecycle and that needs to be captured | If history is missing we need to add the code to ensure we capture that history. This is what a data lake or data warehouse does. | New feature to be prioritized. |
Most of the above DQ issues can be explained as simple bugs or missing features and do not require a huge IT initiative. Some of the above issues are due to misunderstandings of your data or shortcomings of existing systems. Understanding and fixing these issues takes time, but there is no need to create an umbrella “Data Quality” initiative to solve them. Take the time to research and understand the issues, then prioritize and fund each separately. Better yet, add these to a sprint backlog as technical debt and have your team fix the issues as time permits.
The quality of your data probably isn’t as bad as you think it is. A few things to consider:
debits != credits
, but it’s possible your marketing team won’t care. If the quality is critical, you should stop and fix it now. You don’t need a corporate DQ initiative, you just need to fix the bug.Whenever I start a new analytics project I start with some basic data profiling. And every time my query outputs do not match the system of record’s output. I know the issue is with my query, not with the DQ…I just haven’t learned the nuances of the data model yet. I don’t yet understand the data. I’m probably missing a filter condition on the JOIN clause or don’t have the WHERE predicate perfect. Systems of Record, and their data models, are finicky. I’m supposedly an expert at this stuff, and if I have difficulties querying unfamiliar systems, imagine how your users struggle.
The dirty little secret of data is that most organizations spend the vast majority of their time cleaning and integrating data, not actually analyzing it. We spend too much time cleaning data so it is perfect and not enough time trying to understand what the data is saying. Don’t let “perfect” be the enemy of “good enough”.
Have you ever heard this one?:
“Data scientists spend 80% of their time cleaning dirty data. The other 20% they spend complaining about cleaning dirty data.”
Actually, most experienced data scientists learn to love dirty data. There is signal in the noise.
I was once brought in to a company to build real-time call center analytics. I spent a few days ingesting the 3rd party call center management software’s data in real-time. I built a few basic reports to give to the SMEs to ensure I understood the data and problem domain.
The call center manager, let’s call him Joe, looked at the reports for a few minutes and said, “See, I knew it, our people aren’t on the phones for 6 hours a day. Your report proves they are only on the phone for 3.5 hours.”
We started to have a dialog and Joe told me he gets a report every day at midnight from the vendor that lists his CS reps’ time on the phones. The reports generally show they are logged onto the system when they should be.
Joe: “But when I look out the window I see all of my reps smoking cigarettes or chatting in the courtyard. Your reports look much more accurate.”
I assured Joe my queries were probably wrong. I called the vendor and explained that when Joe calls the batch API he sees one set of metrics, when I call the real-time API and then aggregate all of the hours, I get a different metric.
Me: “I assume my queries are wrong. I’m really sorry to waste your time with this but could you help educate me?”
Data vendor: “Ohhhh, I see. Um, could you give me a few days and we’ll have the data problem fixed?”
Me, in a confused tone: “Um, sure.”
And in a few days the nightly batch reports were showing the same metrics as my queries. Yet I didn’t change any code. After some research it was revealed that the nightly batch reports were never right and the CS reps knew this and were “gaming the system” knowing they would get paid for being on the phones when they were goofing off.
Point is: Use an existing data project to find the root cause of existing bugs and fix them. You don’t need a special DQ initiative to do this. Data science projects are especially good at uncovering anomalies in your data.
“Data Quality isn’t a destination, it’s a journey”
IT-driven DQ projects are usually less about the actual quality of the data and more about paying down technical debt. I once heard this from a CTO:
“Our data is a mess. It’s not in the right format for analytics. It’s not well documented. We need to fix DQ.”
Unless a data owner/sponsor says those words, I respectfully disagree. Instead, those words, spoken from a CTO, feel to me to be existing technical debt that is causing hardship for the IT organization. We can fix these issues without a huge project. But the CTO needs to devote resources to it.
A data quality bug is nothing more than an opportunity to refactor code.
Quite simply, when the business owner says it matters. But don’t accept the first answer you hear, which will likely be, “Of course data quality matters to us! Stop asking stupid questions!” Instead, probe a little deeper. Data Quality needs to be driven from the business, not from IT. If the business won’t fund a data quality initiative and devote resources to the project, that’s indicative of the overall priority of the initiative. Actions speak louder than words. If a business manager says that data quality matters but won’t fund a cleanup initiative, maybe DQ doesn’t really matter.
I could list off lots of examples where DQ is critical. Here’s 2: healthcare systems and anything in the nuclear industry. These are obvious, yet the level of data quality that is needed is still dictated by the business owner. I have a lot of healthcare customers and each one complains about their DQ. That makes me think DQ is a matter of degree.
Quick story: In some cases, dirty data is much better than clean, quality data. Many, MANY years ago a city contracted with me to merge their many disparate healthcare IT systems into one Golden Patient Record
(we often call this a “master data management” initiative). They had dozens of “patient record” systems each with a varying level of patient data quality. Some systems were hospitals, others were skilled nursing facilities, still others were drug treatment facilities. As a good data scientist using statistics I classified disparate patient records based on the probabilities they were matches.
After a few months of coding and reviewing with the stakeholders everyone agreed the matching rules were safe and we planned a phased rollout of the “patient master data” to a few of the source systems. We started with the drug treatment facilities, assuming those were the least critical.
Then disaster struck. Some drug rehab patients had learned to game the system over the years. They knew each facility had a different patient management system and even though each of the facilities’ systems communicated and shared data, they couldn’t share patient data if it wasn’t pristine. A slight misspelling of a name, an address that was a bit different, an SSN that was off by a digit, an inaccurate birthday…all of those things were enough to make it appear as though YOU were 2 different people to 2 different drug treatment facilities.
Now, let’s assume you were a rehab patient and you knew this and you were prescribed 5mg of a drug and you got it filled at Facility X and then, since it appeared as though you were 2 different people, you got the same 5mg at Facility Y. And let’s say you did this 5 or 6 times a day every day you get a new prescription. Each time you change your patient record just a bit. If you don’t overdose you’ll probably build up quite a tolerance!
Once we matched the data and deployed the code you could no longer go to multiple facilities and appear as though you were unique people. This meant you could now only get the 5mg you were prescribed, instead of much, MUCH more.
Suddenly, patients were getting the correct doses, but it wasn’t enough given their new tolerance.
Even though we solved the DQ issues we were told to quickly roll back the changes. Data quality would have to wait until this issue was solved.
Here’s a couple brief ideas:
data projects should have unit tests. It’s proven that code quality goes up with unit tests, yet very few data practitioners write unit tests. For new data projects, insist on doing unit tests.
all new analytics projects should have data validation and monitoring built in. If data skews or appears to have quality issues, we need to know this quickly so we can evaluate what actions to take.
data scientists can help you with this. We are skilled at data profiling using statistics. We can build ML algorithms that can detect data skew, data drift, and anomaly detection.
for any new data initiatives, determine and document the DQ expectations. DQ is rarely perfect, but that shouldn’t mean a potentially profitable analytics project should be shelved. Sometimes if we accept initially lower data quality we can create compelling analytics projects, with the understanding that we need to handle quality over time.
The Microsoft Technology Center is a service that helps Microsoft customers on their Digital Transformation journey. We know that successful data projects are less about the technology and more about the process and people. (This is especially true for Data Quality.) We’ve been doing data for years. We are thought leaders, conference speakers, and former consultants and executives. We’ve learned the patterns that will help you execute successful DQ projects. And with the cloud we can execute in weeks instead of years. Here are some specific ways we can help:
Does this sound compelling? SUCCESS for the MTC is solving challenging problems for respected companies and their talented staff. Does that sound like folks you can trust on your next project? The Digital Transformation is here, and we know how to help. Would you like to engage?
Are you convinced your data or cloud project will be a success?
Most companies aren’t. I have lots of experience with these projects. I speak at conferences, host hackathon events, and am a prolific open source contributor. I love helping companies with Data problems. If that sounds like someone you can trust, contact me.
Thanks for reading. If you found this interesting please subscribe to my blog.
Dave Wentzel CONTENT
Data Architecture case study