DaveWentzel.com All Things Data
Azure SQL Data Warehouse: Problems, Performance, Things You Need to Understand
Submitted by Dave Wentzel on Mon, 2017-02-20 23:03
This post isn't going to teach you the basics of Azure SQL Data Warehouse (ASQLDW). This post is going to honestly cover the strengths and weaknesses of the product and will show you the *possible* pitfalls you may encounter. Let me say upfront, ASQLDW is a *terrific* product, if you understand *exactly* what the use-cases are for the tool. This is not an ASQLDW 101 post, if I write about a concept you don't understand I assume you can find that info using a basic google search. To be clear, I have tons of experience in the MPP space...I've used Vertica for years (and I understand the costs/benefits of MPP. Most of the same "gotchas" with ASQLDW/APS are also true of Redshift, Snowflake, and Vertica. Each tool has a slightly different take on how to overcome perceived MPP shortcomings when compared to a traditional RDBMS. Do your research!
This post will tell you things you should know if you are SERIOUSLY considering ASQLDW. This post doesn't have nice graphics and sample queries. This is an architecture discussion. If you feel I'm wrong about anything, let me know in the comments. I have the sample queries to back up my arguments, but I'd rather keep this conceptual vs building up strawman arguments. YMMV. What I've seen as shortcomings in my experience, you may not.
Why am I writing this?
There are a lot of misconceptions about ASQLDW in the community. ASQLDW gets its roots from the SQL Server APS/PDW product, which very few people use and understand. This was an expensive, esoteric offering. Since it's been "cloudified" it seems everyone is writing about it and presenting on it, yet very few of these folks have clearly used it.
ASQLDW is an MPP database, meaning it is SQL Server as the compute engine and Hadoop/HDFS as the storage engine. What does that mean? Google what an MPP database is, but in a nutshell, MPP databases are :
- aka "shared nothing". It is pure scale-out vs an SMP which is usually shared-disk.
- focused on gaining HUGE performance improvements on read workloads against star schemas (or similar).
- The emphasis is on "scans" of data, not seeks for singleton rows.
- data is stored (generally) as a columnstore
- very sensitive to fragmentatation (so be careful with your ETL designs). Many MPPs will not even allow updates to existing rows. When UPDATEs are allowed, they are written "off-row" until the entire row/block can be rewritten.
- data is stored, usually, with some amount of duplication on multiple HDFS/Hadoop nodes.
- since data is stored on possibly MANY hadoop nodes that means that processes that are sensitive to latency and require scanning all data on all nodes will not work well. For example...let's say you are INSERTing into a fact table that has a FK back to a dim table. Um, well, no you don't. There is NO DRI in most MPP databases (or the concept of DRI may be there, but is implemented differently than what you may be accustomed to)! Wait...there is no DRI? NO DRI. This is simply because the FK check would require a SEEK to confirm the key on potentially EVERY HDFS node. That won't scale...remember, I said MPPs aren't great at SEEK workloads. This also means PKs aren't allowed, nor table-level CHECK constraints. Take a minute to think about it and it should make sense if you understand MapReduce.
- You are going to need to re-think your ETL processes. There is no DRI. If your ETL relies on DRI, well, you better start planning. There are only 32 concurrent queries in ASQLDW...you don't want to devote too many to ETL. ETL is very different in the MPP world. More below.
- You don't just point your reporting tools at ASQLDW and let 'er rip. MPPs are not, generally, backing stores for operational reports. They are "analytics platforms". Think about tabular models in SSAS...that's how you should probably think about an MPP.
- meant for BIG DATA workloads. Analytics on BILLIONS of rows. Do you think a small 1TB EDW will perform well on a multi-node MapReduce system? Maybe, maybe not.
If the above points blew your mind I suggest you study ASQLDW more to make sure you UNDERSTAND all of the issues with an MPP. If you don't understand those archtiectural choices, you aren't ready for MPP or ASQLDW!!!
I get a lot of calls from companies that just bought Vertica. They are looking for a consultant to help them understand how to implement it. I always ask, "Why did you choose Vertica?" The answer always surprises me and is usually something like, "we are scared that our current database manager won't be 'cloud scale' and we are planning for the future." Bad rationale. You know you need Vertica when you exhausted all of the possibilities of your current database manager and can't scale anymore and you've brought in tuning experts for that DBMS and can't get it to work. Then you need to understand MPP and Vertica and all of the tradeoffs you are going to make when you take the jump. You don't implement Vertica lightly. The same mentality should be used with APS and ASQLDW.
I see a lot of presentations and stuff on the interwebs that basically say, "if you need Azure SQL dbs >1TB then use ASQLDW." Whoa, that's probably not gonna work. But, if you are going to consider ASQLDW you need to plan and do a POC to ensure it meets your needs. You will have a much better chance of success with ASQLDW if you have a greenfield EDW implementation, meaning you have control over the tools, code, and schema and can choose tooling that supports ASQLDW. If you have a brownfield EDW that you want to migrate to ASQLDW...you *will* hit issues. Most can be overcome. Some can't. I can assure you that you WILL have more gray hair afterwards, but you may get radically better performance too. Caveat emptor.
Here are issues that you need to understand and TEST
|Max 32 concurrent queries||You are only allowed 32 concurrent queries/processes *at the highest DWU tier*. So, if you are streaming data in, or have multiple concurrent ETL processes, or do parallel overnight cube builds...well, you need to plan. This is not a flaw in ASQLDW...this limitation exists with most MPP systems. For instance, Redshift only allows data loading through one connection. What people do is spin up a separate Redshift cluster solely to do their data loading. You may wonder why MPP systems suffer from this...simple...it's difficult to manage concurrency and isolation when your data is coming in as row form and the engine needs to build the columnstores. Even Vertica, which allows unlimited streaming into the db, will eventually go to a slower load mode once it can't transform the data to the columnstore fast enough due to capacity. Frankly, if you really understand MPP systems then you realize that this is NOT a limitation, you must simply design differently. If you find this to be problematic to your thinking, then I posit you are NOT ready for ASQLDW or MPP.|
|Your ETL mindset may need adjustment||If you are accustomed to using SSIS and rely on key enforcement from the database, well, rethink it. You can still push to ASQLDW using SSIS but the best performance will be Polybase loads which can only be done by writing to a blob and then issuing a CETAS over blobs. That's a few extra steps, but you'll get better performance. If you don't understand polybase loads, better google it now. Your ETL pipelines should be: Check keys in existing data --> load new data to staging/landing area, preferably via polybase --> check keys again -->move/copy/push/partition switch data into prod table --> update stats.|
|Choose Your ETL Tools Wisely||Fast data loads require polybase. Just because your ETL tool "supports ASQLDW" doesn't mean it is doing polybase. Also, your ETL tool may claim to do "parallel loads" but is really just making multiple connections to your ASQLDW. And, remember, 32 concurrent queries is all you get.|
|Some tools won't be able to even connect||ASQLDW doesn't support all aspects of SQL Server. Temp tables (#Temp) are not supported. Many tools make heavy use of these, hence, they won't work. But even basic jdbc connection options are not supported. For instance, if your java app uses client-side cursors it will fail to connect. ASQLDW, smartly, does not support client-side (or even server-side) cursors. The error is: Cursor support is not an implemented feature for SQL Server Parallel DataWarehousing TDS endpoint. The assumption is your tool should issue nice, well-formed "star schema" - like queries and retrieve the full result set. Client-side paging solutions, which generally use client-side cursors, are not the right use case for ASQLDW...as of today. Another example...ASQLDW does not support all transaction isolation levels and concurrency semantics. And did you know that the ONLY option for transaction isolation is READ UNCOMMITTED? If you think that is a deal-breaker, well, I suggest you check other MPPs. This is a FEATURE...and a very good one!!!|
|Watch your dynamic SQL||Some tools emit dynamic sql that is very ORM-like...meaning JOINs are done via a bunch of "SELECT FROM table1 WHERE ID IN (select ID FROM table2 WHERE ID IN (SELECT ID FROM table3)))). ASQLDW has a hard time compiling and optimizing that code and pushing the predicates down to the hdfs nodes. I've seen cases where the code won't even compile after you string about 7 of these "joins" together. I suggest you profile your tool and ensure it emits nice, standard looking SQL. Look at the query plans carefully. (BTW, an ASQDLW query plan looks very different from a standard SQL query plan).|
|Really understand your data distribution options and plan accordingly. Understand DMS (Data Movement Service) and how it works with your queries.||HASH and ROUND_ROBIN are your only options. Let's be honest...ROUND_ROBIN is almost a guarantee for performance problems at scale. So, really understand HASH and design your tables and data access accordingly. In APS there is an option for REPLICATE as well, and hopefully this is coming to ASQLDW soon. With REPLICATE you are saying that a given table's data is stored locally on each (virtual) hadoop/hdfs/compute node. This cuts down on DMS operations that need to copy data from one node to another to do filtering. In the APS/Vertica/Netezza world REPLICATE is a quick-and-dirty method to relieve data movements among nodes.|
|You won't be nearly as productive, initially, with your T-SQL Ninja skills||Remember, only a subset of TSQL is supported. Do you use a lot of temp tables? ASQLDW has them, but they are syntactically very different. Do you need cursors? Find another solution. Even simple things like using SELECT to set the value of a scalar variable is not supported (you have to use SET...us old-timers usually use SELECT and I REALLY STRUGGLE remembering this). Do you build dynamic sql by cursoring over sys.objects (or whatever) for instance to build dynamic DBA scripts? It'll still work...but it will take ORDERS of magnitude longer. Example...I build statements to update my stats...in SQL Server the process of *building* the string to EXEC takes a few milliseconds. In ASQLDW this will take 2-8 minutes!!! This makes sense if you remember it is MPP, but this really slows down your productivity.|
|Your Performance Monitoring Tools Probably Won't Work||ASQLDW has its own DMVs. Even the query plans look different. Even how you approach query tuning needs to change. In SQL Server how many times have we heard that we need to eliminate SCANs and build "covering indexes"? Guess what? You don't do that in ASQLDW.|
|Don't immediately dismiss ASQLDW if you run into a blocker||ASQLDW has alternative use cases than just an EDW/analytics backing store. For example, let's say you receive a lot of data into Azure storage blobs. While not load it to ASQLDW using polybase and then connect to that data using Azure SQL or SQL in a VM? This is a great use case. Using ASQLDW as an extension of my ETL/ELT workloads has saved me a lot of time during my load windows.|
Definitely do NOT use ASQLDW for these use cases
- OLTP. Not meant for that. MPP is not OLTP. Would you implement OLTP on hadoop/hdfs? Probably not.
- SEEK workloads. SEEKs won't scale on an MPP. Scans will.
- You just need an Azure/cloud-based SQL db that supports > 1 TB. Consider other approaches. Sharding, elastic queries, vertical partitioning, and even IaaS may be better choices. Or open a support ticket...odds are our Friends in Redmond can throw a switch and give you a larger SQL PaaS db.
- if your read workloads are sensitive to latency. Remember, ASQLDW is essentially doing MapReduce constructs under-the-covers. This means that for smaller workloads and result sets you are going to experience greater runtimes due to the inherent data movement of MR. At scale, with lots of data, what you lose with MR data movement you more than gain with MR efficiencies of "pushing code to the data".
If you are going to hire a ASQLDW Expert
...or YOU claim to be an expert...you better be able to answer the following questions. But first, I doubt anyone outside of Redmond is truly an "expert" in ASQLDW. I AM NOT. You need to separate guys who've used it, from guys who've read about it.
Ask these questions, the answers should be succinct and authoritative. Anything else means they read about it, but have never done it:
- in one sentence, what is difference between SQL Server and Azure SQL Data Warehouse?
- does Azure SQL Data Warehouse support DRI?
- Name an approach you use to do constraint enforcement
- what is polybase?
- briefly explain what a good ETL pipeline to ASQLDW should look like
- name a DMV that is unique to ASQLDW
- what is DMS and why is it critical?
Any candidate with experience in a competing MPP should get extra points because they'll understand some of the nuances with ASQLDW's unique MPP implementation. In fact, I'd almost rather hire someone with mediocre SQL Server experience but hands-on Redshift experience, vs a rock-star SQL Guy with no MPP experience.
Some questions for these candidates of competing MPPs:
- which competing MPPs have you used?
- How is ASQLDW different?
- Name a few limitations of that MPP
- What made you choose Redshift/snowflake/Vertica/whatever over your standard RDBMS?
- What were the good, bad, and ugly of your MPP?
Simply listen to the answers. Did the candidate sound authoritative? You are building an enterprise solution. Make sure your candidate is enterprise-class.
This post was not meant to scare you off from Azure SQL Data Warehouse. Rather, to educate you on architecture decisions you need to be aware of. Undertaking an MPP database should not be done lightly. It should not be done based on some marketing slides or blog posts by people who experimented with the product using AdventureWorksDW. There is too much misinformation on the internet regarding ASQLDW. Too many people "claim" to understand it yet they've only spun up POCs and have never worked with the tool at-scale. If you are going to try ASQLDW you need to be educated.
Having said all that, for the right use cases, Azure SQL Data Warehouse is a great product. I love it.