DaveWentzel.com            All Things Data

November 2010

Denali New Features - TempDB enhancements

In my last post on Denali New Features I mentioned that tempdb can now create temp tables using the collation of the calling database vs the collation of the instance.  This is a huge deal.  In a non-contained (or pre-Denali) database any temp tables you create will inherit the collation of the instance (actually the model db, which is the template for tempdb whenever the instance is restarted).  In most cases this is not a big deal.  In fact, when SQL 2000 was released and we were allowed to have dbs with different collations people thought this was a panacea.  In practice it wasn't because every application/database I've ever worked with utilize temp tables and you will likely get collation mismatch errors in these situations (I'll demo below).  Granted, there are workarounds, such as declaring your temp table cols with a defined collation, but they aren't easy.  

In Denali a non-contained db continues to behave as it always had regarding collation mismatches.  An example:

PRINT 'This codeblock generates the collation conflict'
USE [master];
select collation_name AS 'Master Collation' from sys.databases where name = 'master';
CREATE DATABASE CollationError COLLATE SQL_Lithuanian_Cp1257_CI_AS
GO
USE CollationError;
CREATE TABLE foo (colA NVARCHAR(200));
CREATE TABLE #bar (colB NVARCHAR(200));
SELECT * FROM #bar WHERE colB IN (SELECT colA FROM foo); 
DROP TABLE #bar;
USE master;
DROP DATABASE CollationError;
 
PRINT
This codeblock generates the collation conflict
Master Collation
--------------------------------------------------------------------------------------------------------------------------------
SQL_Latin1_General_CP1_CI_AS
 
(1 row(s) affected)
 
Msg 468, Level 16, State 9, Line 4
Cannot resolve the collation conflict between "SQL_Lithuanian_CP1257_CI_AS" and "SQL_Latin1_General_CP1_CI_AS" in the equal to operation.

But in a contained db temp tables are created in the collation of the calling db, not the master db.  An example:

'This codeblock works due to the containment!!'
USE [master];
PRINT 'we have to change some settings first...'
EXEC sp_configure 'show advanced options',1
RECONFIGURE
EXEC sp_configure 'contained database authentication' , 1
RECONFIGURE
GO
select collation_name AS 'Master Collation' from sys.databases where name = 'master';
CREATE DATABASE CollationError CONTAINMENT = PARTIAL COLLATE SQL_Lithuanian_Cp1257_CI_AS
GO
USE CollationError;
CREATE TABLE foo (colA NVARCHAR(200));
CREATE TABLE #bar (colB NVARCHAR(200));
SELECT * FROM #bar WHERE colB IN (SELECT colA FROM foo); 
DROP TABLE #bar;
USE master;
DROP DATABASE CollationError;
 
 
This codeblock works due to the containment!!
we have to change some settings first...
Configuration option 'show advanced options' changed from 1 to 1. Run the RECONFIGURE statement to install.
Configuration option 'contained database authentication' changed from 1 to 1. Run the RECONFIGURE statement to install.
Master Collation
--------------------------------------------------------------------------------------------------------------------------------
SQL_Latin1_General_CP1_CI_AS
 
(1 row(s) affected)
 
colB
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
(0 row(s) affected)

Very cool indeed.

Tags: 

Denali New Features - Contained Databases

I always wished SQL Server had some sort of "post Restore script" option that could be added to any database.  It's common for production databases to be "passed around" development and QA.  Most of these people do not really understand SQL Server, take for instance the concept of logins and users.  After restoring a database to a different instance and attempting to log in through your application it's common to see some kind of error about "user not existing".  The simple fix is to sync the server logins and database users.  It would be nice to be able to programmatically have things like this happen after any restore.  In every environment I've ever worked in I've always created a "sync" stored procedure that handles these things and then train everyone to run "sync" after any restore.  What goes in sync?:

  1. If my schema owner login account does not exist, create it
  2. run sp_change_users_login for the schema owner, and any other accounts that may exist, and print out a list of orphans at the end
  3. sp_changedbowner (needed for TRUSTWORTHY settings)
  4. ALTER DATABASE SET TRUSTWORTHY ON...we need this so certain things like xp_cmdshell, Service Broker, and IMPERSONATION work correctly.  
  5. GRANT IMPERSONATE ON USER::dbo TO [OurCustomSchemaOwner];
  6. ALTER DATABASE ENABLE_BROKER/NEW_BROKER logic.  We use Service Broker and it is possible that a user will have two different databases on their machine with the same Service Broker GUID (due to "sharing" copies of prod databases).  When we restore we need to ensure SSBS is enabled or create a new GUID.  We also like to make sure nothing is "stuck" in a queue.  
  7. recreating db-specific SQL Agent jobs
  8. Linked servers
  9. CLR objects
  10. You could put obfuscation logic here for things like social security numbers for regulatory compliance.  
  11. Things like UPDATE STATISTICS can also be handy. 

In Denali we now have the concept of a "contained database" which removes some of the affinities of the "application" to the database.  Features (as of CTP1):

  1.  the ability to create DB users without server logins.  
  2. a new DMV (sys.dm_db_uncontained_entitites) shows you what *might* break the containment rules.  One example, you have code referencing a 3 or 4 part name.

Benefits:

  1. During initial development you may not know where your application is going to "live".  By making the db "contained" you are sure that when the db moves to QA/staging/production you haven't missed any dependencies.
  2. It makes administrator's lives easier since there is less to move when a db moves.  Clearly this is a benefit for production copies that are restored to test and dev, but also for DR.  

Contained databases won't totally obviate the need for my "sync" procedure, but it will be another valuable component of my arsenal.  

I'll show you an example of how to create a contained database in the next post, Denali New Features - TempDB enhancements.

Tags: 

My Interview Brain Teasers Part 4

This is the next part of my series on how I like to conduct interviews.  

I have 2 other brain teasers I might also use when interviewing a candidate, depending on circumstances.  

The Parts Problem

I use this example as part of a troubleshooting exercise.  

Assume you work for a manufacturing company that has 27 machines that all produce the exact same product, a 1oz gold coin.  However, one of the machines is producing a coin that is only .9 ounces.  You have a stack of coins from each machine.  Using a single scale and a single weighing, how do you determine which machine is producing the faulty coins.  

You'll find variants of this on the internet but I like to again change it a little so it is less obvious to those who have seen the brainteaser and answer but don't understand the reason for the answer, and also to make it relevant.  I'm hoping the candidate again breaks down the problem and realizes that the number of machines really doesn't matter.  Assuming 2 machines, how do you determine the problem machine?  Take 1 coin from MachineA and 2 coins from MachineB.  If the weighing is 2.9 then the problem is MachineA, if it is 2.8 then the issue is MachineB.  I'll help the candidate get this far in the reasoning.  From there I would expect him to extrapolate 3, 4, and 5 machines and their weighings.  By then I would expect the algorithm to be obvious and easily calculable for 27 machines.  

The Salary Problem

This is another fun one for me.  It again tests whether the person can deal with changing requirements quickly.  I once had a candidate that had a college class in cryptography, a subject I find interesting but baffling frankly.  But, I do understand the basics such as Shared Password, Public Key encryption, etc, and I think I understand when to use each.  The candidate rattled off differences in AES from DES from 3DES.  Wow!  But did he understand heuristically how to use these things?  Let's find out, here's the problem (my boss wasn't thrilled when he was present in an interview where I used this):

Our company, similar to most others, has a policy against employees divulging their salaries to other employees.  In fact it is grounds for immediate termination.  But I think we could all agree that knowing each other's salary (or an average salary) would certainly help us to negotiate raises better.  So, how would we all determine our average salary without anyone knowing anyone else's individual salary?

Since these job candidates are technical in nature I usually get an answer like, "I'd build a webpage where everyone anonymously connected, entered their salary, which was stored encrypted in a table using 3DES.  After everyone did this an average would be calculated.  "  That's an ugly answer for so many reasons.  How can I assume the developer didn't backdoor the encryption?  I usually get a dumb look after mentioning that.  OK, then, let's change requirements, this project must be completed in the next 5 minutes.  By now the candidate realizes a coding solution is out of the question.  That's important.  Sometimes coding a solution is NOT the answer to a given problem.  

Then I usually get something like, "Everyone writes down their salary on a slip of paper and puts it in a hat.  We can average that."  OK, it's true that it meets requirements but the fact is that I can likely deduce who makes what individual salary based on seniority and intuition, so, let's assume that not only can no one know anyone else's individual salary, but no one can know *any* individual salary.  

Again, wait for 30 seconds until the dumb looks subside.

For my crypto dude I mentioned that his crypto class gave him his answer, if he thinks about it.  It's very simple one-way encryption.  One employee writes a random number (it probably should be 5 or 6 digits, or if your company is very generous, 7 digits) on a slip of paper and passes it to the next person who adds his salary and passes it to the next person who adds his salary, etc, until the paper is returned to the originator who adds his salary and finds the average of the sum and destroys the paper.  

But, I really like to see candidates that can think about solutions that don't involve technology.  As engineers we tend to like to find answers using our specialized tools.  This is often *wrong*.  Sometimes the answer is much simpler.  For instance, if you like building things (like me) you probably like to measure everything before you cut it ("measure twice cut once").  But that is wrong.  The real goal is to never measure in the first place!  For instance, if I need to replicate a 2x4's length to apply a sister to it I would *not* measure it.  I would put my new board next to it and draw a pencil line directly where the first board ends.  No measuring tool needed at all.  

When you tell the candidate to solve it without technology it is interesting to see what the answers may be.  In reality I've never had a single candidate be able to solve this problem.  I usually give a hint that I don't need an exact average, you can round it to the nearest thousand dollars.  At that point I've heard plenty of answers, my favorite being everyone puts a penny in their coffee cup for each thousand they make.  Everyone comes to the break room and on the count of 3 empties their cup into a trash bag.  Everyone counts the pennies and divides by the number of employees to get the average.  

Tags: 

Denali New Features - Better Paging

Paging is the process of returning a specific x set of ordered rows from a result set.  Think of a google search that yields a million hits but displays the first 20 on a page.  When you click Next you get results 21 - 40, etc.  There are lots of ways to do paging in SQL Server, many of which I have written about before.  I believe the best performing solutions use some variant of ROW_NUMBER () OVER and CTE's.  In Denali there is a new TSQL extension for this.  First, spt_values, if you are not familiar with it, is a table of numbers from 1 to 2047 (assuming you use my WHERE clause).  

PRINT 'show me the next 10 rows starting after row 100'
SELECT number
FROM (select number from master..spt_values where type = 'P' and number > 0) tally
ORDER BY number
OFFSET 100 ROWS FETCH NEXT 10 ROWS ONLY;
 
PRINT 'first 5 rows'
SELECT number
FROM (select number from master..spt_values where type = 'P' and number > 0) tally
ORDER BY number
OFFSET 0 ROWS FETCH FIRST 5 ROWS ONLY;  --FIRST and NEXT function identically
 
PRINT 'use some variables'
DECLARE @firstrow int = 500, @pagesize int = 3
SELECT number
FROM (select number from master..spt_values where type = 'P' and number > 0) tally
ORDER BY number
OFFSET @firstrow ROWS FETCH NEXT @pagesize ROWS ONLY; 

show me the next 10 rows starting after row 100

number

-----------

101

102

103

104

105

106

107

108

109

110

 

(10 row(s) affected)

first 5 rows

number

-----------

1

2

3

4

5

(5 row(s) affected)

use some variables

number

-----------

501

502

503

(3 row(s) affected)

 

Note that you must have an ORDER BY clause since this is actually an extension of that functionality.  

The one thing that is missing is that it would be nice to also return the total number of possible rows in the result set (or maybe as part of OUTPUT).  This has already been requested by others.  This would help in displaying the total number of pages to display buttons for, not to mention the "total results" that displays after your google search.  

Very useful.  However, before I would ever implement something like this I would want to do some performance comparisons.  When potentially scanning millions of rows of data I would want to know that this is not a performance drain.  It's not fair to test this for performance issues while it is in CTP1 however.  In the past CTE's were touted as the next best thing since sliced bread.  The implementation (especially with recursive CTEs) showed a lot of performance problems.  

Tags: 

How To Deceive With Charts and Graphs - Part 2

In Part 1 I covered using statistics inappropriately for qualitative analysis.  

Take a look at the following graphic from a typical "Stocks in the News" section of The Wall Street Journal. 
 
 
Take Exxon Mobil specifically.  It's at $91 and the movement is about 1% over the sample period.  Yet the graph's min and max is $50 and $125.  Given the data plots for Sept to Dec the range never fluctuates out of the $75-$100 band.  Should we infer that further in the past the range was wider?  We don't know.  A better graph would have a range of $75-$100 so the reader could more closely see any potential trend or volatility, at least in the short-term.  Yet these WSJ charts are still presented this way...to this day. 
 
You can take this too far though.  Suppose you were the author of the accompanying article about ExxonMobil and you owned a ton of shares.  You want your shares to rocket higher from readers getting excited from the graph.  How do you do that?  Take the Y axis and shrink the band further to a range of $85-$95 and then shrunk the X axis to just Nov-Dec.  That would really look like a rocket stock!!!
 
Here is an example using a LoadRunner graph.
 
 
Here we have a dual scale graph.  The scale on the right lists the number of vusers for the given test.  It is the green line that is almost off the top of the graph.  That's generally not good scale...leave a little white space on the top and bottom of your scale.  To make matters worse the left scale quantifies the resource usages shown by all of the other lines plotted on the graph...and they are cramped into the bottom 2/7's of the graph (no value rises above 2) making this even more unreadable. 
 
Y Axis Manipulation
I'm sure you've seen graphs where the Y axis does not start at 0...but has a "break mark" on the axis to indicate the fact.  For instance, a DJIA chart of the past 10 years might show a range of 5000-15000 yet the Y axis will start at 5000.  The "break mark" will indicate to the observer that the graph is not to scale.  This is the correct way to display this type of graph. 
 
In other cases, such as magazines that may want to show trends in the Dow, will not show the break mark and will just start at 0.  This means the bulk of the graph is meaningless since the Dow's movement occurred in the narrow band of 5000-15000...the first 5000 are meaningless. 
 
However, sometimes the break mark is not included purposefully to deceive.  Consider this gold graph. 
 
 
A cursory look may lead you to believe that South Africa's percentage share of world gold production has increased up to 1983.  But you'd be wrong.  Why?  First, note the Y axis starts at 10, not 0, and the graph has no break mark.  On a percentage basis, South Africa's production hasn't really changed.  In 1950 South Africa produced about 13 million out of 28 million ounces (46%) and in 1980 they produced about 20 million ounces out of 45 million, for about 44% of world production.  The chart sure doesn't tell me that!!!  On a percentage basis this chart is telling that the communists (centrally planned economies) are producing a greater percentage of total ounces.  This is a poor graph to tell us that though. 
 
Charts Without Any Scale
Take the following chart:
 
 
This is the famous "Laffer curve" which is used as the basis of Reagonomics.  Do a google search if you want to read more.  Whether or not you agree with supply-side economics, the chart has some major flaws which are what I want to focus on. 
 
A quick background.  The thought is that there is a magical tax rate where a government can collect the maximum amount of revenue.  If the rate is too high there is a disincentive to work, hence tax revenues drop off.  If the rate is too low, some people work as much as possible, but mostly the government is not collecting as much as it could. 
 
What's the problem with the chart?  Well, what is that maximum tax rate?  No one knows.  The chart is good to discuss theoreticals, but doesn't help us pragmatically.  If you want to show theory then a chart without numbers is OK, and that's really the only time it is acceptable. 
 
I can't help but give you my thoughts here.  First of all, supply siders will state that lower taxes will always generate more tax revenues, hence we should lower the tax rates.  As you can see with the Laffer Curve, this really is NOT the case, unless you believe people work harder when rates are lower, hence adding to the revenue pool.  I don't believe this.  Supply-side economics is just another tax hike meant to fool you.  A tax is a tax and Reagan never did lower spending...he actually increased it, causing huge deficits.  Just my opinion but the best policy matches closer to the Austrians (www.mises.org).   
Tags: 

My Interview Brain Teasers Part 3

This is Part 3 of my series on how I conduct an interview.  This brainteaser is sometimes called the Survival of the People brain teaser.  As I've said before, just regurgitating the brain teaser to an interviewee is really lame.  Instead, I modify the brain teaser to fit the situation.  I like to ask this teaser if the candidate tells me how easily he can adjust to scope creep.  This works best in a team interview where everyone is sitting around a table.  I start by saying: 

Your boss has $100 to give each member of the team.  He sticks a Post It on everyone's head with a 1 or 0 on it.  Everyone can see everyone else's Post-It but their own.  Whomever can guess their number gets their bonus.  We start with the new guy who announces his guess and proceed to each subsequent co-worker around the table, clockwise.  You can work as a team and assume everyone will execute the plan perfectly.  How can we ensure our boss disperses the most money?  You cannot say anything but "0" or "1" and you cannot give any non-verbal hints.  If your boss thinks he are not following the spirit of the contest he will award no money.  Lastly, your boss gets to hear your plan and adjust requirements at will!  

In every case I've tried this one the candidate starts by having all co-workers simply guess.  The probability (assuming an even, random distribution) is 50%.  Not great.  The boss hears your plan and ensures they are not evenly distributed.  The candidate will usually then say that every other person should simply say the color of the next person around the table.  This ensures at least 50% success, assuming a random distribution the probability is 75%.  

It is at this point where I tell them the perfect plan will guarantee that everyone, except one, can get their bonus.  That's a big hint.  I like to see if the candidate now focuses attention on this fact.  The logical conclusion is that of course the "new guy" will have to guess, giving him a 50% chance, but also implies that he can give valuable information to the next guy.  The next guy in turn should be able to both announce his answer correctly and also give helpful information to the next guy, etc.  

At this point the candidate will likely say, "I would alter my voice to indicate that the next person in line is either a 0 or 1.  So if I say ZERO that means I'm a zero and the next person is too, but if I say zero that means I'm a zero but the next person is a 1."  This is a really good answer in my opinion because we are answering for ourself and giving the next person the answer.  But now the boss changes the requirement so that the next person to answer is chosen at random so now that information cannot be conveyed.  

This makes things much more difficult.  I usually allow the candidate some uninterrupted time to think through it, then I give a little hint, "notice that every candidate must guess either a 1 or 0, a basic bit.  Could that basic bit also represent something else that is binary?   Let's assume ZERO ALSO EQUALS EVEN and ONE ALSO EQUALS ODD. "  Again, I've always gotten stumped looks.  Fine, let's do an example.  Assume 10 co-workers are sitting around the table with this distribution of 1's and 0's:

Candidate Jim 1
Bob 1
Fred 0
Dave 1
Joe 1
Steve 1
Mike 0
Marsha 1
Trisha 0
John 1
  • Candidate Jim does not know his number.  But he can see there are 6 ONES. Since Candidate Jim sees 6 ONES (an EVEN number), he will say ZERO (which is our code for ZERO).  He is wrong, but oh well.  
  • Bob now realizes that Candidate Jim saw an EVEN number of ONES, yet he only sees 5 ONES.  He now knows he is a ONE and says so.  
  • Fred know Candidate Jim saw an EVEN number, and Bob removed a ONE, leaving 5 ONES, and he still sees 5 ONES, so he must be a ZERO.  
  • Dave knows Candidate Jim saw an EVEN number of ONES, one ONE was removed and Dave still sees an EVEN number of ONES, meaning he is a ONE.  
  • Joe knows there was an EVEN number at the start and 2 ONES were removed, and he still sees 3, so he is also a ONE.  
  • Mike knows the count was EVEN, 4 were removed, and 2 are left.  He is a ZERO.  
  • Marsha knows the count was EVEN, 4 were removed, and she sees 1 left, so she must be a ONE, etc.

Denali New Features - Sequences

SQL Server finally has the concept of a "sequence" that Oracle has had for years.  Here is a post I wrote up years ago regarding a few ways to create your own Sequence-like structures in SQL Server.  A sequence is basically an IDENTITY that can be used across many tables.  These are useful when an ID needs to be unique across tables or when your [ORM wants to pre-build its INSERT statements without making multiple db calls.  

SET NOCOUNT ON
CREATE SEQUENCE dbo.Seq
    AS INT
    MINVALUE 1
    NO MAXVALUE
    START WITH 1;
 
SELECT NEXT VALUE FOR dbo.Seq;
SELECT NEXT VALUE FOR dbo.Seq AS AllocatesAnotherNumber;
SELECT NEXT VALUE FOR dbo.Seq AS AllocatesAThirdNumber;
PRINT 'Use the Sequence to insert into a table'
CREATE TABLE #temp (ID int NOT NULL);
INSERT INTO #temp VALUES(NEXT VALUE FOR dbo.Seq);
SELECT ID AS NoteThisIs4 FROM #temp
DROP TABLE #temp;
 
CREATE TABLE dbo.FOO (ID INT NOT NULL PRIMARY KEY DEFAULT (NEXT VALUE FOR dbo.Seq), BAR varchar(200) NOT NULL);
 
PRINT 'How does a SEQUENCE work with transaction control...'
SELECT NEXT VALUE FOR dbo.Seq AS NextValPriortoInsert
INSERT INTO dbo.FOO (BAR) VALUES ('First test')
SELECT NEXT VALUE FOR dbo.Seq  AS NextValAfterInsert
BEGIN TRAN
       INSERT INTO dbo.FOO (BAR) VALUES ('Second test')
       SELECT NEXT VALUE FOR dbo.Seq NextValAfterInsertButBeforeRollback
ROLLBACK TRAN
SELECT * FROM dbo.FOO
PRINT 'note that the row INSERTION rolled back but the ID was still allocated.' 
SELECT NEXT VALUE FOR dbo.Seq AS PostRollbackNextValue
 
PRINT 'note that current_value is really the most recently allocated value in the sequence.  This may or may not be'
PRINT 'intuitive depending on your experience with other Sequence paradigms.' 
SELECT current_value FROM sys.sequences where name = 'Seq'
DROP TABLE dbo.FOO
 
PRINT 'restart the sequence'
SELECT NEXT VALUE FOR dbo.Seq AS NextValBeforeReset
ALTER SEQUENCE dbo.Seq RESTART WITH 1;
SELECT NEXT VALUE FOR dbo.Seq AS NextValAfterReset
 
PRINT 'we can get a range of IDs using a nifty MS-supplied stored proc.  Very useful for caching IDs for an ORM'
PRINT 'note we must use sql_variant here'
DECLARE @first_value sql_variant, @last_value sql_variant
EXEC sys.sp_sequence_get_range
    @sequence_name     = 'Seq',
    @range_size        = 1000,
    @range_first_value = @first_value OUTPUT,
    @range_last_value  = @last_value OUTPUT;
 
SELECT convert(int,@first_value) 'first_value' , convert(int,@last_value) '@last_value'
SELECT NEXT VALUE FOR dbo.Seq AS 'NextVal'
 
IF EXISTS (select * from sys.sequences where name = 'Seq')
BEGIN
       DROP SEQUENCE dbo.Seq
END;
 
SET NOCOUNT ON
-----------

1
AllocatesAnotherNumber

----------------------

2
AllocatesAThirdNumber

---------------------

3
Use the Sequence to insert into a table

NoteThisIs4

-----------

4
How does a SEQUENCE work with transaction control...

NextValPriortoInsert

--------------------

5
NextValAfterInsert

------------------

7
NextValAfterInsertButBeforeRollback

-----------------------------------

9
ID          BAR

----------- ----------------

6           First test
note that the row INSERTION rolled back but the ID was still allocated.

PostRollbackNextValue

---------------------

10
note that current_value is really the most recently allocated value in the sequence.  This may or may not be

intuitive depending on your experience with other Sequence paradigms.

current_value

-----------------------------------

10
restart the sequence

NextValBeforeReset

------------------

11
NextValAfterReset

-----------------

1
we can get a range of IDs using a nifty MS-supplied stored proc.  Very useful for caching IDs for an ORM

note we must use sql_variant here

first_value @last_value

----------- -----------

2           1001
NextVal

-----------

1002

Tags: 

Denali New Features

The next version of SQL Server (code named Denali) was released earlier this week as CTP1.  You can download it here.  As I get time to play with it over the next few weeks I'll post some thoughts here.  I'll keep everything indexed here.  For starters, BOL is now almost totally online, here is the link to Books Online online.  

One thing that I think really sets SQL Server apart from Oracle is its tracing facilities, namely Profiler.  Whenever I get my hands on a a new version of SQL Server I like to first see what is new in Profiler.  In Denali we now have Plan Guide Successful and Plan Guide Unsuccessful Events.  These are used to determine if your carefully constructed plan guides are really being utilized correctly or if they are not the events can help you determine cause.  

Posts to come:

Tags: 

My Interview Brain Teasers Part 2

Here's another brain teaser I like to use during interviews (nodetitle)...I call it "rejection letters."  This is very similar to the Monty Hall Problem.  I like to ask this brain teaser when a candidate is cocky or claims to never make mistakes or not need assistance.  Or I'll ask the candidate a question like, "When you have a difficult problem and you just can't find a solution on your own, when do you seek assistance?"  The question and it's answer are meaningless to me, I'm simply using this as a segue to the problem:

I think you are a strong candidate and I'd like to offer you a job.  Assume there are 3 envelopes in front of you, 2 have a rejection letter and 1 has the offer for employment.  Assume you really want to work here.  Pick an envelope.  Now assume I know which envelopes have the offer and rejection letters.  I'm now going to show you that one of the other envelopes is a rejection letter.  I'm going to offer you a chance to change your envelope choice, is that a wise idea and why?  

I give no hints, remember, the goal is to have the candidate seek assistance.  Very few people can explain why it is a bad idea at this point, even if they know the Monty Hall Problem.  I ask for probabilities of the outcome and I've never heard an answer yet that wasn't 1 in 3 for both choices.  I continue...

New game, assume I have 4 envelopes, 3 are rejection letters.  You pick one, I show you 2 others that are rejection letters and offer you the chance to change your choice.  Do you and why or why not?  

Again, I ask for probabilities.  I've always still heard 1 in 4.  

I ask again using 5 envelopes/4 rejection letters.  By now every candidate was visibly anxious probably trying to figure out what is going on.  Still I give no hints and still I always get the 1 in 5 answer.  

I continue to 6 envelopes/5 rejection letters.  If the candidate hasn't asked for assistance or has not spotted the trend I go right for 100 envelopes/99 rejection letters.  By now everyone has spotted the trend and can get the answer (if you switch your choice your odds drop to 1 in 2 immediately).  But every candidate has failed in my opinion...no one has asked for help early.  

How to Deceive with Charts and Graphs

(Graphic courtesy of http://xkcd.com/795/)

There are 3 types of lies - lies, damned lies, and statistics.

This quote is sometimes attributed to Mark Twain or Benjamin Disraeli, a famous PM of 19th century Great Britain.  My next few blog posts will show, using real world examples, how misrepresenting data occurs all the time, how to spot it, and how to avoid it.  Most of the examples involve finance and investment since this is near and dear to my heart.  And some examples use charts and numbers from load tests we performed using LoadRunner. 

For today's post we will focus on...
 
Using Statistics Inappropriately or For Qualitative Analysis
Often you will read a newspaper article about someone killed in an auto accident.  If the person was not wearing a seatbelt this will likely be pointed out in a fashion that uses this as evidence of the value of wearing a safety belt.  Usually a quote from the local rescue squad chief states his chances of surviving would have been higher had he been using a safety belt.  Often the quote is presented in such a way that the reader is led to believe the driver who doesn't wear a belt is more careless than the driver who does wear one
 
But does the data support this?  Here is a case where raw numbers/statistics can't tell you that information reliably.  We can't know if the driver is normally a careless driver or not.  Let's say you get into your car on a rainy night and discover your seatbelt doesn't work for some reason.  What would you honestly do?  Would you say that since the seatbelt doesn't work you will not drive?  Or are you more likely to say that you will drive more carefully?  If you say the latter then you are implying that you drive less carefully when you do actually wear your seatbelt.  In that case I would argue a person like you should NEVER wear a safety belt.  
Tags: 

Pages