Data governance teams have not kept pace. They are not accustomed to granting access to data for experimentation. They want to know what the data will be used for. Their default posture is to limit access to data. Certainly it makes sense to limit business users from having direct query access to operational systems that likely contains PHI and PII, but why hinder a business user from valuable data that can be monetized?
Too many business people equate “data governance” with being told NO. If that’s your company, you need to start changing that culture NOW. Companies that CANNOT leverage their data assets to make data-driven decisions will soon find themselves marginalized as their competition embraces the promise of Digital Transformation. Data governance should enable innovation, not be its roadblock.
Here are some common Data Governance Anti-Patterns we see at the MTC, and our recommendations for solutions.
Is your data governance team comprised solely (or even mostly) of IT personnel? If so, I’ll wager your data governance is not aligned with your business goals. A company’s data is not OWNED by IT, it is owned by the business. IT is merely the steward. Business units should have representation on all governance committees. Data governance is multi-disciplinary and should be an enabler of business value.
Data governance is not a function of IT. It should be driven by the business in support of company objectives.
Too many times I hear “we want to use Azure Purview to help us with tagging, classification, access, and lineage of our data to help us with data governance project”. Azure Purview is great for that. The problem is you told me a TECHNOLOGY problem. When I probe deeper around what the company’s strategic goals are for its data assets I don’t get clear answers. And I never hear: “Well, we talked to our business users and what they would really like to see is…”.
The data governance goals should drive the governance tool choices, the tool should not drive the governance goals.
Data governance is a journey, not a destination
Data lakes are the best and most-often used tools for analytics. But I see technologists copying EVERY piece of operational data to the lake. Why? I can see no good reason to put credit card numbers or SSNs in a data lake. They serve no analytical purpose. The more copies you make of sensitive data the more your risk for a breach increases.
I’ve heard data leaders say they need some sensitive data in the lake so the lake can be “the single source of truth” and support operational reporting that requires this data. If that’s the case then you have 2 choices:
Data lakes (or whatever you use for your analytics sandbox) have “zones”. Everyone calls the zones by different names – “landing”, “raw”, “curated”, “bronze”, “silver”, “gold”. There are 2 key reasons why we have zones. Based on the zone name we should implicitly know:
Different areas of the data lake should have different governance approaches.
Self-service analytics REQUIRES a more “open” approach to data access. There is no other way.
Sometimes when I help customers with a difficult analytics problem I’ll have someone interrupt me and want to discuss the data governance implications of the outputs of our work. Said differently, they want to control the data we are trying to produce BEFORE WE’VE EVEN PRODUCED IT.
This is the wrong approach.
The better approach is to allow the analytics team to find “the nuggets of gold” first, and solve the business problems. Then, as part of a review process, let’s have a thoughtful conversation about how we should govern our new insights.
I call this “late governance”. We are deferring all governance decisions until we are sure we have generated a valuable business insight. I’ll say it again, data governance should not stifle innovation.
“Early” data governance is diametrically opposed to self-service analytics, which is a business goal for every customer I talk to.
Similarly, most analytics projects I work on will require the team to ingest new datasets. Proponents of “early governance” will insist that this new data be controlled, before we’ve even experimented to see if it provides lift. This stifles the experimentation process.
Some analytics personas will want access to the most raw, dirty data, so they can spot trends and anomalies. Think data scientists. They will want access to the “raw” or “bronze” area of the data lake. But this data would be overwhelming and misinterpreted by other personas (like business users). Let these other folks see that the data is available (via the catalog), but make them request access and explain why you are asking them to request access to it (because you want to understand why they want access to raw data). Other personas need “self-service” analytics but don’t really have SQL skills. These users should have access to summarized and “certified” data that they can use in Power BI or Excel. These personas would have access to the “gold” and “platinum” zones which would map closely to the semantic tier and data warehouse facts and dims. “Business analyst” personas that understand SQL might need access to more granular, yet “clean”, data. This maps closely to the “silver” or “curated” zones for most data lakes.
This structure allows far more flexibility. We can give everyone access to what we think they need based on their persona and they always have visibility into ALL data and can request access if needed.
Never force your data scientists to use data warehouse data (facts and dims). It almost never works for their needs. They’ll get frustrated and you will stifle their innovation. Data lakes were originally built to store data that these personas needed to do their jobs. Since then everyone else has seen the efficacy in using the lake for analytics where the data is easier to manipulate.
One of the goals of a data lake is to be “an analytics sandbox”. This means users will need to make copies of data for experimentation…they’ll need to semantically-enrich that data (usually via SQL) and they’ll need to save copies of it somewhere where they can reference it and continue to build upon it as they search for business value. To do that they need to have write access to the sandbox area of the lake. The sandbox is very much like a temp table in SQL Server.
Human users, whether business analysts or data engineers, should never have write access to any area of the lake except the sandbox. The remainder of the lake should only be writable by the scheduler/job user. Not even the Ops Team should be allowed to write to the lake. This is a core lake governance concept.
In the past the DBA never gave write access to the warehouse to common users. This made analytics very difficult since most analysts are not able to do all of the data enrichment they need in a single SQL statement. This is why it was always easier for a user to export the data to Excel and do the analytics there.
While we want users to be able to have a writable sandbox, we do NOT want them to have a shareable sandbox. Why?
Imagine your analyst, Annie, found a valuable business insight and has it persisted in her sandbox. She tells the Operations team she would like to move it to the data warehouse so it can eventually be added to a Power BI dashboard. The governance team finds out and wants to hold various meetings to understand the data better. The data warehouse team says they’ll need 3 months to integrate the data into the fact table. The dashboarding team wants to meet with Annie to understand…ugh…isn’t this exhausting? Annie found something valuable and she’s being punished. What does she do? She shares the data in her sandbox with her team…and likely later she shares it with everyone. Now the data is “published” in various Power BI dashboards and has zero governance. That’s not good.
Make people follow the governance process by not allowing ad hoc sharing. If users complain about stifled innovation then that means your governance process needs an overhaul. FIX IT!
You’ve probably heard the old aphorism: Without governance the data lake soon turns into a data swamp. Not true. If you follow all of the above advice there is NO WAY you’ll ever have a swamp. But make sure your processes aren’t so onerous that you are a drag on innovation.
How can you spot an organization that struggles with data governance? They have obviously inefficient business processes.
Business users will find all kinds of valid reasons to do things that violate governance rules. It’s OK to make exceptions to your governance plan.
Here’s an example I’ve seen many times: the governance team mandates that only Spark, Synapse, and other “approved” tools can be used to query the data lake. But eventually a department will purchase a query tool they want to use with THEIR data. Too many organizations will disallow this. Why? This really doesn’t make any sense. The compute engine (ie, the “tool”) is not important, the data is.
With a data lake, the model should be “Bring Your Own Compute” (BYOC). Use the tool you want to use.
This is just one example. Governance teams need to be aware of when exceptions to policies need to be made.
In so many companies I work with the users equate “data governance team” with NO. Don’t let this be your company.
Always do the ACL’ing in the data lake. This allows users to BYOC (see above). The tool no longer matters. The user simply logs in and the tool passes the credential to the lake to determine access.
Never apply the permissions at the compute tier, do all permissioning strictly at the data persistence tier…the lake.
This never works. Why? For the same reasons departmental data marts sprung up 20 years ago in spite of the corporate data warehouse: The business can’t wait for IT, so it builds its own solutions. We would call these “skunkworks” projects or “Shadow IT”. These projects would solve the immediate business problem, but at the long-term expense of data governance and building more and more data siloes.
Every customer I talk to has multiple data lakes (“ponds?”). Most are structured around business units. My advice is not to stifle this innovation simply to conform to a corporate governance standard. Instead, assist these business units with their governance efforts by providing the tools and templates you are using for the corporate lake. Help them, don’t hinder them.
Data governance is not a “project”. Data is constantly changing, and so is the data management field. Data governance should be viewed as an on-going corporate “program”.
Is your data governance team embracing modern analytics notions like:
Data governance is a bigger problem today than ever. Companies are using data for more novel use cases and regulators are taking notice. But your governance strategy still has to foster innovation. Your company’s view of its data estate and risk tolerance has likely evolved in the last few years. Self-Service Analytics is impossible to achieve with outmoded governance anti-patterns that I’ve outlined above. This is a delicate balancing act that the MTC understands very well. If you want to achieve Digital Transformation then you must realize these are issues of culture.
The Microsoft Technology Center is a service that helps customers on their Digital Transformation journey. We know that successful data governance efforts are less about the technology and more about modern processes…and people. Data governance is changing to support dual mandates of heightened regulatory burdens and self-service initiatives. At the MTC, we’ve been doing this for years. We are thought leaders, conference speakers, and former consultants and executives. We’ve learned the patterns that will help you transform your governance programs. And with the Azure cloud and our governance technologies like Azure Purview, we can execute in hours-to-days instead of months.
Does this sound compelling? SUCCESS for the MTC is solving challenging problems for respected companies and their talented staff. Does that sound like folks you can trust with your data? The Digital Transformation is here, and we know how to help. Would you like to engage?
Are you convinced your data or cloud project will be a success?
Most companies aren’t. I have lots of experience with these projects. I speak at conferences, host hackathon events, and am a prolific open source contributor. I love helping companies with Data problems. If that sounds like someone you can trust, contact me.
Thanks for reading. If you found this interesting please subscribe to my blog.
Dave Wentzel CONTENT
Data Architecture Digital Transformation