Data Quality Doesn't Matter

DQ does kinda matter. Why data quality and master data management projects fail and what you can do to remove risk.

“Data Kwality Doesn’t Matter”

That’s serious clickbait.

Let me explain my position. Data quality (DQ), of course, does matter. But the way most companies handle DQ projects invariably leads to project failure. Master data management (MDM) projects follow the same patterns of failure for the same reasons. These projects are expensive, time-consuming, risky, IT boondoggles.

I’m going to show you a better way to run DQ initiatives based on my experiences at a tech executive and my engagements at the Microsoft Technology Center. These methods work and remove the risk.

I hear this a lot:

“We aren’t starting any new data projects until we fix the quality of our existing data.”
“Our data projects fail because of our ‘garbage in/garbage out’ culture”
“Master data management is taking all of our resources but we aren’t sure it is providing value”.

That sounds horrible.

But then I also hear these comments:

“For the last 20 years we’ve started a new Data Quality initiative every 7 years and it usually fails after 2 years.”
“Our IT organization has an active DQ initiative, I just don’t know what it is solving or its status”
“We’ve bought expensive data cataloging and lineage software because the vendor promised us it would solve our DQ problems. We haven’t seen a return on the investment.”

I can spot a few trends in these statements. Let’s dive deeper.

Reasons DQ projects fail:

Not solving an immediate business problem. Just saying “we want better data quality” can’t be quantified. Certainly every organization has a handful of severe data quality problems that affect revenue or a business line. Those need to be fixed, but that can be done tactically without a nebulous “data quality project”.
Most are IT-driven projects, instead of business-driven projects. IT initiatives that don’t have stakeholder buy-in are often “IT boondoggles”. Instead, focus on a few business problems that are caused by DQ, get the business units to fund them, and use that as an opportunity to fix some DQ problems.
In some cases, nobody really cares about data quality (they just won’t admit it). If they did, quality would have been fixed years ago.

The reason data quality projects get shelved during business downturns is because executives know these are IT boondoggles. If data quality really mattered, there would be a “stop the line” mentality and issues would be fixed as soon as they are identified. Your words say one thing, your actions say something else.

Here’s a thought exercise:

You tell your accounting department that an ETL bug is causing reported revenue to be off by $2. What happens next? Your accountants won’t use your system until the bug is fixed. They will complain about your system and demand an immediate fix. Data quality is critical in accounting and this will be a P0 ticket. Even if it’s just two dollars. Debits must always equal credits. I have no argument with that. In accounting, accuracy matters. Now, ask your CMO if she cares if reported revenue is off by $2 million. You’ll hear a very similar diatribe: “Our data quality is horrible. How can we be expected to understand our business with numbers that aren’t accurate?” But, if you start probing and you show that the $2 million error is only a 0.5% variance-to-actual, and the trend is accurate, the CMO will calm down: “Well, what we really need to know isn’t the actual number, we need to know what to do next.” Exactly! Data quality is less important than you thought it was. For an accountant a data quality bug, even if it’s 2 cents, is critical. For a CMO, a bug that is SEVEN ORDERS OF MAGNITUDE WORSE ($2M vs $0.02), really doesn’t matter

Point is: solving data quality for data quality’s sake, or starting an MDM project because your IT/data team told you to, is not a recipe for a high value, high return project. Find out if quality really matters.

What exactly is “Data Quality”

There are 7 recognized core tenets of data quality. Let’s look at each one and determine just how important each really is (just my opinion). This may surprise you.

Tenet	What is it?	Why is it important?	How do we fix it?
Completeness	Is all requisite information available?	Why is data incomplete? Sometimes data is initially incomplete but subsequent events add to the missing data. An SME can likely tell you why data appears incomplete.	Bug or misunderstanding.
Consistency	Does the analytics data match the System of Record’s data	If this is the problem, you have an ETL bug.	Fix it or prioritize it on the backlog. Consistency issues are always legitimate problems, but usually are easy to fix.
Duplication	Multiple instances of the same data	You either have a bug or you have a misunderstanding of the data. Usually duplicate data is data that is enriched during subsequent events. See completeness above	Bug or misunderstanding.
Conformity	The data adheres to corporate standards	Determine why data doesn’t conform, there may be a good reason. Example: varying DATE formats are confusing, but sometimes cannot be changed in the system of record	Bug or misunderstanding.
Accuracy	The data reflects reality	Accuracy issues are usually due to bugs in existing processes in the Systems of Record.	Fix the bug.
Timeliness	Data is available when needed	This is more about system design than DQ. I always design data systems so they can be real-time if ever needed, without rework.	System Design problem, not DQ
History	Data has a lifecycle and that needs to be captured	If history is missing we need to add the code to ensure we capture that history. This is what a data lake or data warehouse does.	New feature to be prioritized.

Most of the above DQ issues can be explained as simple bugs or missing features and do not require a huge IT initiative. Some of the above issues are due to misunderstandings of your data or shortcomings of existing systems. Understanding and fixing these issues takes time, but there is no need to create an umbrella “Data Quality” initiative to solve them. Take the time to research and understand the issues, then prioritize and fund each separately. Better yet, add these to a sprint backlog as technical debt and have your team fix the issues as time permits.

Everyone thinks their DQ is bad…it probably isn’t

The quality of your data probably isn’t as bad as you think it is. A few things to consider:

If the data is important, yet the quality is poor, wouldn’t you have fixed it already? Accountants won’t use a system where debits != credits, but it’s possible your marketing team won’t care. If the quality is critical, you should stop and fix it now. You don’t need a corporate DQ initiative, you just need to fix the bug.
Perceived DQ problems are often just a lack of understanding of the data. If you think your DQ is poor, maybe you should go ask an SME. It could be you don’t understanding the data’s meaning.

Whenever I start a new analytics project I start with some basic data profiling. And every time my query outputs do not match the system of record’s output. I know the issue is with my query, not with the DQ…I just haven’t learned the nuances of the data model yet. I don’t yet understand the data. I’m probably missing a filter condition on the JOIN clause or don’t have the WHERE predicate perfect. Systems of Record, and their data models, are finicky. I’m supposedly an expert at this stuff, and if I have difficulties querying unfamiliar systems, imagine how your users struggle.

Data Scientists LOVE Dirty Data

The dirty little secret of data is that most organizations spend the vast majority of their time cleaning and integrating data, not actually analyzing it. We spend too much time cleaning data so it is perfect and not enough time trying to understand what the data is saying. Don’t let “perfect” be the enemy of “good enough”.

Have you ever heard this one?:

“Data scientists spend 80% of their time cleaning dirty data. The other 20% they spend complaining about cleaning dirty data.”

Actually, most experienced data scientists learn to love dirty data. There is signal in the noise.

I was once brought in to a company to build real-time call center analytics. I spent a few days ingesting the 3rd party call center management software’s data in real-time. I built a few basic reports to give to the SMEs to ensure I understood the data and problem domain.

The call center manager, let’s call him Joe, looked at the reports for a few minutes and said, “See, I knew it, our people aren’t on the phones for 6 hours a day. Your report proves they are only on the phone for 3.5 hours.”

We started to have a dialog and Joe told me he gets a report every day at midnight from the vendor that lists his CS reps’ time on the phones. The reports generally show they are logged onto the system when they should be.

Joe: “But when I look out the window I see all of my reps smoking cigarettes or chatting in the courtyard. Your reports look much more accurate.”

I assured Joe my queries were probably wrong. I called the vendor and explained that when Joe calls the batch API he sees one set of metrics, when I call the real-time API and then aggregate all of the hours, I get a different metric.

Me: “I assume my queries are wrong. I’m really sorry to waste your time with this but could you help educate me?”

Data vendor: “Ohhhh, I see. Um, could you give me a few days and we’ll have the data problem fixed?”

Me, in a confused tone: “Um, sure.”

And in a few days the nightly batch reports were showing the same metrics as my queries. Yet I didn’t change any code. After some research it was revealed that the nightly batch reports were never right and the CS reps knew this and were “gaming the system” knowing they would get paid for being on the phones when they were goofing off.

Point is: Use an existing data project to find the root cause of existing bugs and fix them. You don’t need a special DQ initiative to do this. Data science projects are especially good at uncovering anomalies in your data.

“Data Quality isn’t a destination, it’s a journey”

IT Boondoggles

IT-driven DQ projects are usually less about the actual quality of the data and more about paying down technical debt. I once heard this from a CTO:

“Our data is a mess. It’s not in the right format for analytics. It’s not well documented. We need to fix DQ.”

Unless a data owner/sponsor says those words, I respectfully disagree. Instead, those words, spoken from a CTO, feel to me to be existing technical debt that is causing hardship for the IT organization. We can fix these issues without a huge project. But the CTO needs to devote resources to it.

A data quality bug is nothing more than an opportunity to refactor code.

When does Data Quality really matter?

Quite simply, when the business owner says it matters. But don’t accept the first answer you hear, which will likely be, “Of course data quality matters to us! Stop asking stupid questions!” Instead, probe a little deeper. Data Quality needs to be driven from the business, not from IT. If the business won’t fund a data quality initiative and devote resources to the project, that’s indicative of the overall priority of the initiative. Actions speak louder than words. If a business manager says that data quality matters but won’t fund a cleanup initiative, maybe DQ doesn’t really matter.

I could list off lots of examples where DQ is critical. Here’s 2: healthcare systems and anything in the nuclear industry. These are obvious, yet the level of data quality that is needed is still dictated by the business owner. I have a lot of healthcare customers and each one complains about their DQ. That makes me think DQ is a matter of degree.

Sometimes dirty data is exactly what you want

Quick story: In some cases, dirty data is much better than clean, quality data. Many, MANY years ago a city contracted with me to merge their many disparate healthcare IT systems into one Golden Patient Record (we often call this a “master data management” initiative). They had dozens of “patient record” systems each with a varying level of patient data quality. Some systems were hospitals, others were skilled nursing facilities, still others were drug treatment facilities. As a good data scientist using statistics I classified disparate patient records based on the probabilities they were matches.

After a few months of coding and reviewing with the stakeholders everyone agreed the matching rules were safe and we planned a phased rollout of the “patient master data” to a few of the source systems. We started with the drug treatment facilities, assuming those were the least critical.

Then disaster struck. Some drug rehab patients had learned to game the system over the years. They knew each facility had a different patient management system and even though each of the facilities’ systems communicated and shared data, they couldn’t share patient data if it wasn’t pristine. A slight misspelling of a name, an address that was a bit different, an SSN that was off by a digit, an inaccurate birthday…all of those things were enough to make it appear as though YOU were 2 different people to 2 different drug treatment facilities.

Now, let’s assume you were a rehab patient and you knew this and you were prescribed 5mg of a drug and you got it filled at Facility X and then, since it appeared as though you were 2 different people, you got the same 5mg at Facility Y. And let’s say you did this 5 or 6 times a day every day you get a new prescription. Each time you change your patient record just a bit. If you don’t overdose you’ll probably build up quite a tolerance!

Once we matched the data and deployed the code you could no longer go to multiple facilities and appear as though you were unique people. This meant you could now only get the 5mg you were prescribed, instead of much, MUCH more.

Suddenly, patients were getting the correct doses, but it wasn’t enough given their new tolerance.

Even though we solved the DQ issues we were told to quickly roll back the changes. Data quality would have to wait until this issue was solved.

“I have an at-risk DQ project, what are some tips for better outcomes”

Here’s a couple brief ideas:

data projects should have unit tests. It’s proven that code quality goes up with unit tests, yet very few data practitioners write unit tests. For new data projects, insist on doing unit tests.
all new analytics projects should have data validation and monitoring built in. If data skews or appears to have quality issues, we need to know this quickly so we can evaluate what actions to take.
data scientists can help you with this. We are skilled at data profiling using statistics. We can build ML algorithms that can detect data skew, data drift, and anomaly detection.
for any new data initiatives, determine and document the DQ expectations. DQ is rarely perfect, but that shouldn’t mean a potentially profitable analytics project should be shelved. Sometimes if we accept initially lower data quality we can create compelling analytics projects, with the understanding that we need to handle quality over time.

Here’s how we can help you at the MTC

The Microsoft Technology Center is a service that helps Microsoft customers on their Digital Transformation journey. We know that successful data projects are less about the technology and more about the process and people. (This is especially true for Data Quality.) We’ve been doing data for years. We are thought leaders, conference speakers, and former consultants and executives. We’ve learned the patterns that will help you execute successful DQ projects. And with the cloud we can execute in weeks instead of years. Here are some specific ways we can help:

Strategy Briefing Workshops: we show customers how modern data patterns can be leveraged to solve common business problems.
Rapid Prototypes: bring us a data quality problem and we’ll show you how to solve it quickly. Within just a few days you’ll have a better feel for whether our approach works or if this is something you aren’t comfortable with.
Data Science on Demand: our data scientists (like me) will show you how to think differently about your data projects while we help you solve your analytics problem. Data quality should be a component of every data initiative, not some nebulous corporate IT boondoggle.

Does this sound compelling? SUCCESS for the MTC is solving challenging problems for respected companies and their talented staff. Does that sound like folks you can trust on your next project? The Digital Transformation is here, and we know how to help. Would you like to engage?

Are you convinced your data or cloud project will be a success?

Most companies aren’t. I have lots of experience with these projects. I speak at conferences, host hackathon events, and am a prolific open source contributor. I love helping companies with Data problems. If that sounds like someone you can trust, contact me.

Thanks for reading. If you found this interesting please subscribe to my blog.

Dave Wentzel 2021-05-21 CONTENT
Data Architecture case study