DaveWentzel.com            All Things Data

October 2010

Physical Data Modeling AntiPatterns - dynamic data structures

This is part two of my series on Physical Data Modeling AntiPatterns

I've written extensively on EAV Models previously.  As a refresher, in an EAV model the table has 3 cols, one for the "entity", one for the "attribute", and one for the "value".  Something like this in a supermarket model:

 

entity
attribute
value
canned good
productname
green beans
canned good
productdescription
Canned vegetable, not at all tasty
canned good
productmanufacturer
DelMonte
freshfruit
productname
bananas
freshfruit
productdescription
A Tasty Fruit
freshfruit productmanufacturer Chiquita

EAV models are touted as being adaptive to evolving design.  No kidding?  When you only have 3 cols you can pretty much model everything.  The problems are many...

  1. I lose data types (everything is a varchar)
  2. Data constraints are impossible without complex trigger logic
  3. the simplest queries are many, MANY lines of SQL

Not all EAVs are evil.  There may be cases where a system needs to evaluate a property at run-time vs compile-time, meaning I may not have all of the data I need to compile correctly.  That's certainly fine, but we are speaking in terms of a limited amount of known data that would be modeled in an EAV, not transactional data.  

Don't fall for the argument of "unlimited flexibility" with an EAV solution.  It just won't work.  Read my older blog post for complete details. 

Physical Data Modeling AntiPatterns - the misuse of BIT flags

This is part one of my series on Physical Data Modeling AntiPatterns

Joe Celko refers to this as attribute splitting.  In this situation we are trying to model "states" of an entity/object in terms of purely binary relationships (on/off, yes/no, true/false).  In many cases the states of a given entity are not really binary at all and when you use BIT flags you are really short-changing your design.  The example most often cited for misuse of BIT flags is a column to denote SEX (ie, male or female).  If you've given blood recently at a Red Cross center you know that they no longer ask you if you are male or female, rather, what is your current sex and what sex were you born?  And yes, you can identify yourself as something other than male or female (sorry, I didn't ask what other choices there were).  

It is common to model a Person as a specific case in a generic Entity table, which means a corporation would have an entry as well.  Assuming we leave the SEX col in the Entity table we need some value other than male/female for the corporate entity, and NULL is frankly frowned upon.  ISO 5218 defines the valid sex codes as 0=unknown, 1=male, 2=female, 9=not applicable.  My gut tells me this list will be expanded in the next 20 years (see the Red Cross problem above).  So, if ISO compliance is a goal, a BIT wouldn't work in your model.  

What appears to be a binary relationship usually is not.  An invoice status may be Open/Closed upon first look, but later other statuses may be needed.  It's best not to use a BIT unless we are positive that there is a true binary relationship.  

A last problem with BIT flags is when modelers take an entire series of binary relationships and shoehorn them into a single bitmask column where a query would need to use bitwise ORs to decode values of interest.   My problem with this solution is that many people don't fully understand bitwise OR calculations.  Those people will write functions and views to decode the values more easily.  In this case, why not just create a simple series of columns at the start.  I have yet to see any proof where a bitmapped column was absolutely required for performance reasons.  

I'm not saying to totally avoid BIT cols, just always ask yourself, "is there any way there might be a third valid value in the future?"  If the answer could at all be YES, then don't use BIT.  

Libertarians and traffic lights

When I tell people I'm a libertarian they usually respond, "oh, one of *those* people that are against traffic lights and speed limits."  That disappoints me because there is so much more to it than that.  The fact is, there are various flavors to libertarianism, as with any political group.  Most libertarians can agree on the basics though:

  1. self-ownership...each person has control over his own body.  
  2. homesteading/"first use" principle...the first person to make use out of a natural object then has ownership rights to it.  Everything is property and property rights must be respected. 
  3. the non-aggression axiom...agression and coercion against anyone is wrong.  It violates self-ownership and homesteading.  Anarcho-capitalists, myself included, take this to the logical conclusion that a government is really just an entity with a legal authority to commit aggression.  Because of that government is subject to corruption and violates the non-aggression axiom.  

These are sometimes called natural rights because, frankly, they just make sense to most humans.  Non-believers say that those points are great, but we need government to adjudicate when disputes over property rights occur.  But we really don't because all interactions would naturally occur via contract and contract disputes can be solved using private courts.  There is obviously much more to libertarianism than this.  

When people frame libertarians as "those people against traffic lights and speed limits" I get annoyed.  These issues are so minor in our society that I frankly don't even care.  Well, I recently saw a great set of videos from the UK that prove, using real-world experiments, why all of the basic libertarian principles I outlined above really work.  Traffic is a nightmare in UK urban areas.  So, some towns tried something radical...turning off all of the traffic lights.  Watch the before and after videos.  Listen to people's responses.  Listen to the government traffic officials who admit that this works.  

It turns out the abolition of traffic lights actually does accurately portray libertarianism in a microcosm.  You probably have a few doubts that this works already...let me address them before you watch the videos...where they prove the doubts are unfounded.  

  1. Pedestrians will be killed or will never be given the right of way to cross busy intersections.  It turns out that *most* people naturally are cooperative and respect the inherent "first use" principle of the situation...namely, I was here first, I'll utilize the intersection first.  
  2. People will drive faster without lights.  Nope.  Sorry.  When there are traffic lights people focus on the light at the peril of the pedestrians, bicyclists, and fellow motorists on the road.  We've all hit the gas to get through an intersection where the light was "pink".  
  3. People are naturally cooperative, but government-installed traffic control devices create a priority system that is unnatural to people and violates the principles above.  People don't like it and don't really mind violating traffic laws.  Remove the coercive laws and natural cooperation takes over.  
  4. The disabled/blind are totally screwed.  Just watch Part 2 at 3:40.  A blind man expresses why traffic lights DO NOT WORK for the disabled.  

There are 2 items that totally favor the abolition of government traffic devices:

  1. Pollution...sitting in traffic is not great for the environment.  
  2. Look at the money saved on traffic control equipment and signage and energy.  We can make much better use of that money.  

 

Part 2...at about 1:12 you'll see one of the goofiest traffic patterns/intersections you'll ever see in your life.  Great Britain is known for it's "roundabouts", or in the states we call them traffic circles, but that road is crazy.  Clearly it was designed with traffic lights in mind, yet with the lights switched off you can clearly see people are cooperating without any issue.  Accidents will still happen, as will the disputes regarding liability.  But is that so much different from the current system?  

Physical Data Modeling AntiPatterns Series

"Patterns" are commonly used solutions to a given problem.  "Anti-patterns" are the problems with commonly used solutions.  I'm not going to cover why 5NF and Boyce Codd NF are often bad choices sometimes.  And I'm not going to cover why 1NF should be shunned in favor or 3NF for transactional systems.  With antipatterns the goal is to show why common solutions might not be ideal in many situations.  This is by no means a complete list of data modeling anti-patterns, but it's a good start.  In no particular order I'm going to cover...

  1. the misuse of BIT flags
  2. "dynamic" data structures
  3. MUCK table designs
  4. XML as a persistence mechanism
  5. Column values that have different meanings based on context
  6. Circular references

As you review these posts you'll note a recurring theme to these anti-patterns.  Again and again we see cases where a single attribute may have different meanings.  I think primarily these issues occur for two reasons:

  1. data modelers think they are smarter than they really are and think they have discovered a new pattern that no one else has used.  They then apply it en masse without properly researching the idea or thinking of where it will break down. 
  2. the data modeler does not have a firm understanding of what is an "entity", an "attribute", and a "value".  

More to come...

Windows 7 Annoyances

I've been using "newer" server OSs for awhile now, but I've always asked for XP on my work desktop/laptop.  Why?  It's familiar and it does what I need.  I don't need Aero Glass, I don't need UAC, I don't need all of this "cute" stuff on the newer desktop OSs...I need to be able to get work done.  But I've needed to support customers running Windows 7 and SQL 2008 which means I really need to understand UAC a little better, x64 (as far as registry setting differences, etc), and all of the new methods of doing simple tasks that have changed (such as adding firewall port exceptions).  So now I have Win7 and here is my list of complaints (so far):

Quick Launch is Gone

I hate the "pin to taskbar" feature.  I don't want everything scattered around.  Why remove the QuickLaunch toolbar?  When something works, don't remove it Microsoft!!! According to Help and Support on my Win 7 installation: "The Quick Launch toolbar isn't included in this version of Windows. To open programs quickly, you can pin them to the taskbar. For more information, see Pin a program to the taskbar."  That's a curt answer.  Here's how you add it back.  Right click the taskbar, New toolbar, then paste in %appdata%\Microsoft\Internet Explorer\Quick Launch.  Now you just need to position it to your liking.  I backup this folder since I keep various batch files in that folder as well as standard shortcuts.  

Simple Folder Shortcuts aren't simple anymore

I formerly could create folder shortcuts to all of my folders I work with daily (source code folders for each branch, documentation folder, etc).  That seems to be gone.  You can now just create a shortcut that points to %windir%\explorer.exe /e,"path to your folder".  

No Telnet Client!

The telnet client is not installed by default.  OK, maybe I'm the last person in the world to use the telnet client, but why would you remove it from your product?  I use it to ensure firewall ports are open (telnet <machine> <port>).  It can't possibly be that the telnet client is insecure?  It's a client program.  

User Account Control

This is usually first on everyone's annoyance list.  Not for me, it's closer to 4th on my list.  But there is so much here to cover...suffice it to say, it's a productivity drain.  I don't mind sacrificing productivity for increased security, but it's the inconsistencies in UAC that offend me.  Case in point, I'm a vbscript guy.  I have vbs files for everything and some of them make changes to items that I wouldn't normally think would require elevated privileges, such as changing environment variables.  However, if you change an environment variable without being elevated, it will fail.  Fine, I don't mind running elevated.  So I try to right click my vbs file to choose the "Run as Administrator", but that option isn't available.  My emergency alternative is to open an elevated command prompt and run "cscript file.vbs blah blah" which is cumbersome.  Since then I found a cool reg file that helps a lot.  You can change it to add Run As Administrator to any file type.  

Open Command Prompt Here

A simple reg hack added this to all previous OS's...but no more.  Here is a reg file that handles that, plus opens the prompt with elevated privileges.  Functionality like this should not require a reg hack.  

Power User Settings

I'm not sure what to call this so I call it "Power User Settings".  For instance, most people in IT prefer to have their environments almost universally set up a certain way.  It should be easy to enable/disable this quickly.  For instance, everyone I know has these settings turned on, which is not the default:

  1. Open Command Prompt Here (see above)
  2. Show hidden files, folders, and drives
  3. Don't hide empty drives...they are empty because I wish to fill them, so don't hide them on me.  
  4. Don't hide extensions for known file types.  
  5. Don't hide protected OS files
  6. View mode should be detail, not list or thumbnail.  
  7. These settings should apply to all folders.  
  8. I should be able to quickly apply these settings to any computer I am working on, and then revert them to previous settings quickly, for instance if I am supporting another user.  Perhaps this setting should be found when I right click the Start Button.  

Tags: 

"CPU" terminology demystified

Although I understand the differences in these "terms" I often find myself using the wrong term in a conversation and have to double-back and correct myself.  

Term Definition How does it apply to SQL Server?
Socket  the slot that connects to the physical circuit board and houses a microprocessor. 

SQL Server licensing is concerned only with number of sockets.  If anyone tells you something different, well, they are wrong.  Want proof?  Quickly install SQL Express on your laptop (which likely has dual core).  As we all know, Express supports "only one processor".  But note in your SQL Server properties that multiple processors are actually being utilized.  Here is a brief screenshot of my new Intel i7 x64 laptop.  Note it is a single socket, dual core chip with 4 total logical processors. 

 

 

 

Note that taskmanager shows 4 logical processors: 

And finally notice that my SQL Express installation recognizes and is using all 4 logical processors. 

If you want to save yourself licensing fees, opt for hardware that supports the largest amount of cores per socket.  In a previous blog post I outlined exactly how this works. 

Core  an actual processing unit on a physical processing unit  SQL Server really doesn't care about cores. 
Logical Core this is hyperthreading.  A physical core can have many logical cores.  Note that my laptop (see screenshot above) has hyperthreading enabled. 

When it first came out I found hyperthreading did not produce any increase in performance so I always disabled it.  YMMV.  The reason is that everyone's workload is different.  For the workloads I had I knew the constraint was not concurrency and thus number of executing threads, rather terrible code executing against improperly normalized data structures (not of my doing of course smiley).  This means I made the conscious decision to forego the benefits of hyperthreading with its associated overhead to get as many base MHz attacking a thread as possible.  

With Nehalem, it is well documented that hyperthreading has come a long way and the above logic for disabling it likely no longer is valid.  

You can have multi-core processors with or without hyperthreading.  Some chips support both.

Tags: 

SQL Server Connectivity Troubleshooting

Overview

Ask many DBAs to perform basic SQL connectivity troubleshooting and they run into problems they can't seem to solve.  Some examples...if SQL Server listens on a non-standard port, how do you connect to it if the SQL Browser is not working?  How can you tell quickly if a firewall is blocking the SQL port?  How can you do any kind of SQL connectivity troubleshooting without using Management Studio?  This is the process I follow to diagnose and correct almost any connectivity troubleshooting problem with SQL Server.  The key is to always follow the "stack diagnosis" process.  Think of SQL communication as a "stack" and ensure each layer of the stack, starting at the lowest level first, is working before moving along.  The SQL communication stack is very similar to the OSI model.  Don't focus on specific error messages since SQL Server and its drivers will many times throw the *exact* same error messages under multiple, different circumstances.  So any given error message may have multiple solutions.  Also, different releases of the various tools in the stack will change their error messages rendering troubleshooting by error message much more difficult.  For instance, SQL Server "cannot connect" messages in SQL 2000 are very basic, in 2005 the surface area configuration utility was introduced (and subsequently removed in 2008 because it was useless) which caused the messages to change.  Lastly, if you understand the communication stack you will understand why various errors are thrown.  This process will solve almost any connectivity problem.  

The Stack

Here is the basic SQL communication stack.  This is customized for the application I work with which runs on SQL Express.  SQL Express, by default, uses a somewhat random port when it is installed.  Further, it does not install with SSMS or any of the other graphical tools.  

 

This is a more detailed flow chart with actual steps.  You start by working down the left hand side of the chart.  The steps are organized based on where you perform the step (the client or the server).  The color coded "home plate" icons are links to the expanded documentation for that section found further below.  

 

Connection Monikers

Before getting into the details it's important to understand what a connection string looks like in SQL Server.  I call these strings "monikers."  It's just the naming convention required to connect under various circumstances.  For most people the servername is the only component that is required for a connection.  So if my server is called DAVE then my connection string's "Data Source" will be DAVE.   

It's even easier if you need to connect to your SQL Server running locally on your machine.  Simply type (local) or a single dot (.).  So if I am using a stand-alone app I can connect using 

(local) 

or 

.

Note that to connect to a named instance I simply append a backslash and the instance name.  So if my instance on DAVE is called STP my connection will look like this:

DAVE\STP

If connecting locally it will be one of these:

(local)\STP

.\STP

With a non-standard port (not 1433) and SQL Browser not running, use this moniker:

SERVER\INSTANCE,port

So, to connect to the server DAVE, named instance STP, on port 1533 the string passed to sqlcmd/osql is:

DAVE\STP,1533

Note that if I'm connecting local I don't need to specify a port since communication will be via Shared Memory, not TCP/IP.  

 

SQL Ports

Let's go through the "home plate" icons in order.  When you can't connect to your SQL Server always ping it first.  After that make sure you can telnet to the actual port SQL is listening on.  To determine the SQL port, do this:

 

Using Telnet and Firewalling

Now that we know the port let's attempt to telnet to that port from our client:

If we can't telnet to the server then the port is firewalled.  On a corporate network that firewall could be anywhere and you will likely need your network guys to help you troubleshoot further.  But first I would check that the server has it's firewall open for the SQL port.  If you use Windows firewall the process is simple:

 

Using OSQL

After that I always attempt a connection with osql if I have it handy.  osql no longer ships with SQL Server and that is unfortunate since it only requires a single rll for basic connectivity.  It requires no installation of .net like sqlcmd does (so we can guarantee there is no .net driver issue at play), it simply requires MDAC which is installed and configured by default on all MS OS's since at least XP.  If you don't have osql handy then sqlcmd is your next logical choice.  Again, if the SNAC driver is corrupt, or anything in the .net stack is corrupt you won't really know, but that's the best you can do right now.  osql commands are very similar to sqlcmd commands.  You can download sqlcmd using the Feature Pack for SQL Server.  

 

Driver Issues

At this point, any remaining connectivity issues should be from within your app which means you probably have a driver issue.  Troubleshooting this could be another week's worth of blog posts.  Let's just stick to the basics...

  1. Determine what driver your app is using SNAC9, SNAC10, ODBC.  
  2. On the client...start|run|odbcad32.
  3. Switch to the drivers tab and note what drivers are available.  If your app's required driver isn't present then you need to install it.  If it is present, then we need to test a connection.
On the System DSN tab, click Add and then select the driver your application requires.  
The Name can be anything.  The server will be your proper connection moniker, since I use a named instance but the default 1433 port I specify the moniker without the port.  Click Next
Choose your authentication method and ensure the last checkbox is SELECTED.  Click Next.  
If you see this screen without error then you are connected.  

 

 

 

 

Help, I still can't connect

If your application still cannot connect then it could be the application did not install correctly.  For instance, if your application uses a custom third party component for connectivity (such as an ORM tool) then it could be masking your connectivity issues.  I can't help you with that.  

Summary

Hopefully troubleshooting connectivity issues will be less painful for you now.  Remember to always follow a process when diagnosing where the issue lies.  I prefer using my stack diagnosis method where I start at the basic physical layer (am I connected to the network?) and move up to the final steps involving the drivers.  This method has always worked for me.  

Tags: 

Hyperinflation

This is an old BBC documentary that outlines the hyperinflation of Weimar Germany in the 1920's.  If you are short of time and just want the fun part, it starts around 3:00 in the second clip.  If you watch the whole piece try to substitute Germany's problems with the current problems we face in our modern world.  Maybe I'm crazy but I can see that hyperinflation could be just around the corner for us too.  Everyone on CNBC and in WSJ will tell you that is an impossibility, that modern central bankers understand monetary policy better than the Germans and that adequate controls are in place to avoid that outcome.  Then why is there hyperinflation in Zimbabwe?  

I especially like how the piece covers how Hitler came to power after the hyperinflation episode.  It would seem to me that something similar may happen to us if the economy does get bad enough.  Will we have a dictator that murders Jews?  Probably not, but I can see more war being the solution.  

Lastly, if you want a really short, yet very thorough explanation of post WWI world finance that explains the flow of money and hyperinflation in a little more detail...try Garet Garrett's A Bubble The Broke the World.  But, when you read it, keep in mind it was written in 1932, because he perfectly foreshadows WWII and the rise of a dictatorial Hitler as the only solution for Germany to get out of it's WWI war debts.  It's just spooky.  You will like this short book.  I suggest all of this works, they are fabulous, both fiction and non-fiction.  If you ever read Atlas Shrugged you will especially like The Driver which many have said is the source and inspiration for Ayn Rand's masterpiece. Frankly I think Garrett's work is far superior and is a much shorter book.  It also teaches you how Wall Street finance really operates behind the scenes.   

 

 

 

Rebuild Indexes Part 2

In my Rebuild Indexes blog post a few weeks back I outlined a few cases where an ALTER INDEX REBUILD might be preferable to ALTER INDEX REORG.  I won't rehash details here.  I often get accused of writing posts without supplying repro scripts to prove my points.  First, I don't have the time, second I don't think it's always necessary to prove something.  Sometimes common sense should dictate the point.  

I mentioned in the nodetitle post that page splits may be one reason when a REBUILD is preferable to a REORG, but didn't provide scripts to prove it.  To be clear, the REORG will put the pages in logical order, but will not necessarily compact them.  This is important because we want the pages to be as compact as possible so that we can fit as many pages in buffer memory as possible.  Well, it turns out a blogger just last week proved the REBUILD vs REORG argument with some actual scripts.  Here is the post.  Here are 3 important take-aways from all of this:

  1. There is no DMV to maintain page split history so the blogger created a routine to occasionally log page splits and allocation information to a history table so we know definitively if a REBUILD will help us.  
  2. Don't think just because you REBUILD vs REORG that you have solved your problems.  The page splits will still occur meaning that over time all of that wasted page space will return.  You should consider if there is a better way to change your schema to reduce the page splits.  
  3. FILLFACTOR is your friend here.  


Tags: