Submitted by Dave Wentzel on Thu, 2013-09-26 02:09
"Could you take a look at ticket 56789?" I nodded my head in the affirmative. I was already behind in my existing tasks so I was little annoyed.
I opened up the ticket in the system and read the following: "Customer says that procedure FooBar consistently executes in 5 seconds. Customer wants to know if there is any way we could make that a little bit faster. Customer feels this is not a priority but since they use the function multiple times a day they would save some time."
Our system has almost 4000 stored procedures and I didn't immediately recognize the FooBar procedure as something I've worked on. I went into our source control system and found FooBar. The last check-in was seven years ago and, surprisingly, it was by me. Even more surprisingly, the last check-in note stated:"Per Schlemiel, if you ever have to improve performance remove line 262."
What? I was baffled. Why would I write a check-in comment like that? After pondering the issue for a few seconds, I remembered! I shook my head, chuckled, remove line 162, checked-in my code, and went to find Schlemiel.
Flashback seven years ago...
I was still a new employee and was trying to make a name for myself. Our benevolent dictator at the time was Schlemiel. He has since been promoted to a bigger team with more responsibility. He was a well respected, godlike architect and I admired him very much. Schlemiel always saw something in me and I was his to-go guy when he had data issues.
Schlemiel asked me if I could look at the FooBar procedure because it was running a little slow. I asked him for the ticket number.
"12345. Just do this as a favor for me. A couple years back Schlomo wrote this FooBar procedure as a skunkworks project. We never thought customers would really use it so we never bothered to ensure it performed over larger data sets. Now it takes about 14 minutes to execute for most customers. Customers are starting to complain and Schlomo can't figure out what to fix. Take a look, make some recommendations, and we'll figure out how to prioritize."
I pulled up the ticketing system and opened up 12345. Sure enough: "Customer states the FooBar procedure is totally unusable. It takes about 14 minutes to execute. User state she clicks submit and then goes for coffee and bagel. Sometimes it's done by the time she gets back. Please fix it."
I pulled up the code for the FooBar procedure and began looking at it. Within 20 minutes I figured it out. One of the queries joined 12 different tables and one of the JOIN/ON conditions was missing. Effectively there was a cartesian product causing billions and billions of rows to qualify for the final result set. Thankfully the WHERE clause was filtering out those rows before they were sent to the client. Through dumb luck to query was working, but it was not efficient.
I added the JOIN/ON condition and began writing some quick unit tests. My new version proved to be functionally equivalent but took orders of magnitude fewer reads and executed in 500 milliseconds vs about 9 minutes on my test server. I immediately scurried off to tell Schlemiel.
Schlemiel was less-than-enthusiastic. "Now you've gone and gotten yourself a bit of a problem."
Puzzled, I replied, "Why? The tests are perfect, clearly this is a slam dunk. Simple oversight by Schlomo, if you will."
"I've no doubt you solved the problem, but the fact is, you made the performance TOO good. If you release this version of your procedure then management and the customer will never believe that such a dramatic increase in performance doesn't introduce too much risk. The change will never be approved. No one will ever believe that simply adding an ON clause could fix an issue like this. Instead, management and the customer will demand to see endless proofs and reams of unit tests before they have a comfort level. Meanwhile, you are creating agita for yourself and I frankly need you to work on other, more important issues."
As every word passed his lips I felt more and more deflated. "What do you suggest?"
"I would suggest adding a WAITFOR DELAY '00:00:05' somewhere in your code. Let's see that code." He peered over my shoulder, "Right here on Line 262 looks good to me. Once you've done that go ahead and check it in and note that I code reviewed and approved your work. And next time, try not to be so efficient."
The look of dejection on my face must have been overwhelming. Perhaps he noticed a stray tear running down my face. He tried to console me, "Dave, don't worry so much. You did a great job. You took a 14 minute procedure and go it down to 5 seconds. You're a hero. But just remember, today's hero can be tomorrow's goat. Someday you'll do something stupid and you won't be hero. At that moment just remember that you can always go back and remove that artificial WAITFOR DELAY and you'll be the hero again. Remember, you're only as good as your last success."
I pulled up the ticketing system and opened up 12345: " customer states the FooBar procedure is totally unusable. It takes about 14 minutes to execute. User state she clicks submit and then goes for coffee and bagel."
I pulled up the code for the FooBar procedure and began looking at it. Within an hour I figured it out. One of the queries joined 12 different tables and one of the join conditions was missing. Effectively there was a cartesian product causing billions and billions of rows to qualify for the final result set. Thankfully the where clause was filtering out those rows before they were sent a client. Through dumb luck to query was working, but it was not efficient.
I added the join condition and began writing some quick unit tests. My new version proved to be functionally equivalent but took orders of magnitude fewer reads and executed in 500 milliseconds vs about 9 minutes on my test server. I immediately scurried off to tell Schlemiel.
Schlemiel was less-than-enthusiastic. " you've got a bit of a problem.
Flashforward to the present...
"Hey Schlemiel," I found him in the hallway in another building, "remember seven years ago when I tweaked FooBar and it performed so well that you wanted me to put a WAITFOR DELAY in the code? Well, we just got another ticket by a customer asking us to tweak FooBar again. I took out the WAITFOR DELAY and checked the code back in."
I could feel my chest was puffed out to the fullest, I was proud of my accomplishment. Schlemiel paused a second and then broke out in a terrific belly laugh. "You idiot! I can't believe you actually checked in that WAITFOR DELAY. I was kidding of course. No sane person would do something like that. Didn't you realize I was kidding?"
I walked away...again...dejected. I had two major successes, both times the customer was thrilled, and both times I was the goat.
Submitted by Dave Wentzel on Thu, 2013-09-19 23:55
My manager recently quit and it's time to interview for a replacement. Whenever a manager quits it's common to hear the requisite, "I guess we'll have to train a new one" comments, but I'm going to take the process a little more seriously this time.
Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand. --Putt's Law
If you are in IT and are not familiar with Archibald Putt, I suggest you stop reading this blog post, RIGHT NOW, and go buy the book Putt's Law and the Successful Technocrat. How to Win in the Information Age. Putt's Law, for short, is a combination of Dilbert and The Mythical Man-Month. It shows you exactly how managers of technologists think, how they got to where they are, and how they stay there. Just like Dilbert, you'll initially laugh, then you'll cry, because you'll realize just how true Putt's Law really is. But, unlike Dilbert, whose technologist-fans tend to have a revulsion for management, Putt tries to show the technologist how to become one of the despised. Now granted, not all of us technologists have a desire to be management, it is still useful to "know one's enemy."
Two amazing facts:
Archibald Putt is a pseudonym and his true identity has yet to be revealed. A true "Deep Throat" for us IT guys.
Putt's Law was written back in 1981. It amazes me how the Old IT Classics (Putt's Law, Mythical Man-Month, anything by Knuth) are even more relevant today than ever.
Every technical hierarchy, in time, develops a competence inversion. --Putt's Corollary
Putt's Corollary says that in a corporate technocracy, the more technically competent people will remain in charge of the technology, whereas the less competent will be promoted to management. That sounds a lot like The Peter Principle (another timeless classic written in 1969).
People rise to their level of incompetence. --Dave's Summary of the Peter Principle
I can tell you that managers have the least information about technical issues and they should be the last people making technical decisions. Period. I've often heard that managers are used as the arbiters of technical debates. Bad idea. Arbiters should always be the Benevolent Dictators (the most admired/revered technologist you have). The exception is when your manager is also your benevolent dictator, which is rare. Few humans have the capability, or time, for both.
I see more and more hit-and-run managers where I work. They feel as though they are the technical decision-makers. They attend technical meetings they were not invited to. Then they ask pointless, irrelevant questions that suck the energy out of the team. Then they want status updates hourly. Eventually after they have totally derailed the process they move along to some other, sexier problem with more management visibility.
I really admire managers who follow the MBWA (management by walking around) principle. This management philosophy is very simple...the best managers are those who leave their offices and observe. By observing they learn what the challenges are for their teams and how to help them better.
So, what I am looking for in a manager
He knows he is the least qualified person to make a technical decision.
He is a facilitator. He knows how to help his technologists succeed.
Submitted by Dave Wentzel on Tue, 2013-09-17 11:22
"The Wooden Badger" ... this is the story of how I got that nickname. But first, let me say that I really don't "get" Monty Python. I rarely laugh at their sketches. But other people do. Especially programmers for some reason. The same thing happens with This is Spinal Tap. Programmers love it.
Here's an experiment:
Ask a random person if he enjoyed This is Spinal Tap
If the response is any of the following:
"But these go to Eleven" followed by excessive belly laughter
"It's like, how much more black could this be? And the answer is none. None ... more black."
"Remember when Harry Shearer got stuck in his pod".
Then you've got yourself a programmer.
I don't frankly find Spinal Tap all that funny, but programmers do, so best to speak on their terms. Basically, when in Rome speak Roman.
But I digress.
There is one scene from Monty Python and the Holy Grail that is hilarious. King Arthur and his Kinghts of the Roundtable lay siege on a French castle. They build a Trojan Horse, except their version is a Trojan rabbit. At the end of the scene we see the English watching at a distance while the French wheel the wooden rabbit into the castle. King Arthur asks how this plan is supposed to work. "Now we wait until nightfall and we all jump out of the rabbit, taking the French by surprise." They all simultaneously realize the flaw in their plan and slap their foreheads. Gallahad responds, "Um, well look, if we built this large, wooden badger." Simultaneous groans could be heard.
So, why do programmers in particular find this so funny? My theory is that we feel we can do the same thing over and over again expecting different results. In my experience, that doesn't happen much. For instance, we all know that not having proper declarative referential integrity is asking for trouble. Yet every "new development" project I work on I invariably hear someone say, "we don't need foreign keys, we'll actually be able to ensure integrity at the Java tier this time." And somehow the data folks always tend to lose the argument and the initial version of the system is delivered without keys.
There is something even worse than "repeated insanity" and that is the Second System Syndrome. In this case, we take an elegant, simple system, and we refactor it and add needless complexity and call it Version 2. This is much worse because you HAD a working system, now you don't. Imagine what King Arthur's wooden badger would've looked like. More comfortable seating, gilded ornamentation...basically a bunch of useless stuff.
So, how did I get the "Wooden Badger" nickname?
It all started with 2 sprints left in the development cycle. I was asked to do a "whiteboard" design review for an ordered queueing system. Generally a whiteboard review session occurs in the first few sprints because if the design is faulty it's best to know that before too much code is written. There is only one reason to hold a design review this late in the cycle...someone is married to their design approach and knows it is not right and that I will refute it which will cause lots of rework. My arguments, architecturally sound as they always are, can be rebutted with a simple sentence, "Dave, you know, you may be right, but we have to get this release out the door, so we can't go back to the drawing board now." And this argument only works if it is used at the last minute.
So I went to whiteboard session and learned about the requirements first:
There are multiple queues, some ordered, some not, some ordered by different, varying keys.
We must support tens of thousands of messages per minute.
This doesn't sound like anything I haven't seen before. Let's look at the design:
The queues will be stored in the database as a single master queue table (not a Service Broker queue).
There will be 3 outrigger tables to handle supporting details. No foreign keys needed. And since we support multiple, varying keys, let's store them as XML and key-value pairs.
Ordering logic, when needed, will be performed during enqueueing. And that will require shredding the XML for existing data to determine the keys.
Since ordering logic is so complex we need to single-thread ordered queue processing, so they used SQL Server's applock feature...which is a big mutex.
The queues will not "drain", they will keep 14 days of successful messages in the queue tables, and 30 days for any failure messages.
Support people will be able to query the queue tables real-time to look at throughput metrics or to look at individual message routing status.
If you didn't understand the queueing mumbo jumbo above, never fear. Rest assured that EVERY bullet point is a well-known, established anti-pattern. So I began my retort:
Why are we not using JMS? It is an anti-pattern to build queues in a database.
"It's too hard to support ordering across multiple JMSs without going single-threaded."
But an applock is a way to make sure a process in SQL Server is single-threaded.
"Well, we like this pattern better."
Why not use Service Broker?
"What is System Broker?"
Why not have separate queues for ordered and unordered processing?
"Well, it's too late to change the design."
Why not drain the queues and on dequeueing simply write the dequeued data to a History table?
"We can do that in Version 2."
Clearly the designers were happy and nothing was going to change. There was consensus that some people would look at my concerns and consider tweaking the design for Version 2.
Six months later our first customer experienced queueing problems. I again expressed my design concerns and the consensus was that we would begin working on them.
But of course break/fix work is not sexy like feature functionality, so the work was de-prioritized. Another 6 months elapsed until the next customer experienced problems. This time we hit the "knee of the curve" and the design simply could not scale any further. I again suggested we prioritize my simple design changes. I made the case that my few changes above yield radical improvements without a full-scale redesign. We could tweak the implementation and design to be more of a standard queueing system.
(Many developers relate to Star Trek too)
"It sounds like you want to build another one of your wooden badgers Dave."
Me: "I'm sorry, but I'm just trying to adapt a bad design a bit and make it follow best practices a little more. I think the final design won't be perfect, but we'll get the scalability without the cost of a full redesign."
"No, you signed off on the last design, now you just want to keep tweaking it hoping for different results every time. When will you learn that we can't keep designing Wooden Badgers, we have to do things using industry best practices. So, you're new nickname is Badger and we are going to redesign this the right way this time."
"Sounds good to me."
UPDATE: We've had a few additional large customer outages due to this design. Each time I've been asked to participate in giving band-aid fixes. And after every incident I always ask my collegues, "So, Badgers, when are you going to have that redesign done?"
Submitted by Dave Wentzel on Tue, 2013-09-10 09:18
...as if you really needed another one.
I was poking around in SQL Server 2012 and I noticed a new feature of the XML SHOWPLAN. You can now determine why a given query plan will be serial instead of parallel.
I, of course, knew this already, but this will be another arrow in my quiver of proof for why, in many cases, TSQL scalar UDFs are evil.
I'm actually more interested in finding out what the allowable values are for NonParallelPlanReason. I haven't seen this documented yet, and I'm not really sure I care why a plan would/would not go parallel, but this is interesting.
BTW, I've written NUMEROUS times about the performance problems with TSQL scalar UDFs:
Parallelism ... yes, I did know about UDFs forcing serial plans
Submitted by Dave Wentzel on Thu, 2013-09-05 16:53
And The Developer went to the Oracle and asked, "If programming in Delphi, should I use bind variables? What if I'm programming in Cleveland? Or Visual Basic?"
Tom Kyte, aka the Oracle, pondered the question for 3 days while sitting over a crack on a rock, inhaling noxious volcanic fumes. He then, suddenly, responded. "You will bind never performance will suffer."
Immediately the Oracle fainted from the fumes.
The Developer tried to rustle him to consciousness. He did not understand the Oracle's prophesy. Should the missing comma come BEFORE or AFTER the word "NEVER" ?
The Developer could not wait for the Oracle to awaken, the Kanban card was already in the "blocked" state for too long and The ScrumMaster was angry. What should he do?
At that moment Bill Gates appeared in a burning bush.
The Developer: "Father, are bind variables good?"
"Son, we haven't the concept. The closest thing is prepared execution. As with all things my son, imbibe in moderation, for thy performance cup shall never runneth over."
The burning bush disappeared. Still The Developer was confused.
The Developer came across The Great Google and structured his query thusly, "When sending a repeated SQL statement to my RDBMS, should I use a cacheable, parameterized statement?"
He clicked "I'm Feeling Lucky"...because, well, he was.
Stack Overflow responded immediately with The Commandments:
"Thou shalt not commit non-parameterized querying." Regardless of platform…prepared execution/bind variables in ad hoc SQL statements are thy friend. On modern RDBMSs every single call should be parameterized. Always. No exception. Ever.
"Remember to bind, to keep Oracle holy." It is a PITA (pain in thine ass) to return result sets from Oracle stored procedures/packages which is why bind variables are the cornerstone of good performance on that platform.
"Honor thy SQL Server Stored Procedure." They are generally much better than adhoc parameterized statements for a litany of reasons.
No need to pollute thine query cache with thy garbage singleton statements.
Stored procs are easier to change (sql can be recompiled on the fly with any kind of fix…I don't need to monkey with externalized sql embedded in compiled Java.
I can use lots of different kinds of "hinting" that I can't use in ad hoc sql…changing isolation levels, locking paradigms, plan guides, etc.
It's just good design. Best to keep data logic at the data tier.
Decrease the "network effect"…a stored proc call will always be more compact than an equivalent parameterized ad hoc statement. Think a couple bytes don’t matter? Try developing using "the cloud" with SQL Server on EC2 or Azure without using stored procs. The network effect literally kills performance.
Your friendly neighborhood DBA can't "profile" when you need help. "All I see is that you ran sp_execute 15, 1417. Sorry I can't help you. I can't find a stored procedure named "15,1417" in my database. "
If you use stored procs you can always "bear false witness against thy DBA." (Seriously. Just say, "I don't know why that stored proc isn't performing, isn't performance the DBA's job?" It works flawlessly).
4. "Thou Shalt not be Chatty, but should remember to be Chunky, and keep it holy." Regardless of platform, if you are going to send the same SQL statement a gazillion times a minute (say because you are too lazy to "batch up" your INSERT statements) then prepared execution isn't going to help much. The performance benefits of a prepared INSERT statement may save a millisecond, but the real killer is the network chattiness, which can be 16, or sometimes even 50, milliseconds per call. Best to be chunky.
<TODO insert picture of chunky 'ol Dave Wentzel here. Until then, here's another "Chunky">