DaveWentzel.com            All Things Data

March 2014

Git Gotchas for the Savvy Svn'r

In the fourth post in my git series I want to cover some things that caused me issues migrating from svn to git.  Sometimes you need to think differently with git if you want to accomplish tasks that you woud've done differently with svn.  In some cases, not doing these things appropriately can really muck up your git repo.  This can be a blessing or a curse.  Git is much more flexible than svn so you should expect that if you do things incorrectly you might blow something up.  But since this is a VCS the blessing is that everything can be undone if you know how.  

Git pull/fetch/clone/checkout

This caused me the most grief.  Let me cover the purpose of each command:

  • fetch updates your local copy of a remote branch but never changes any of your own branches or working copy.  You must have a "remote" already added (git remote add origin).  If you use git fetch, remember to merge.
  • pull is a fetch with a merge.  This is the equivalent of a "svn update."  You will see the conflicts that will require a resolve.  This requires a remote already configured.  
  • clone is a combination of "git add remote origin and git pull".  In one command it will do everything.  The remote will be called "origin".  This is the equivalent of "svn checkout".  
  • checkout is a git command that is nothing like "svn checkout".  git checkout switches your local files to a different branch in a repository you already have.  You are "checking out" a different codebase than the one you are working on.  

What is "origin" and "origin/master"?

As mentioned in the last section, "origin" is the remote origin of your local repo.  You don't need to use the word "origin" but that is the convention that most developers use and it indicates that this is the "source" of the local repo.  "origin/master" is simply the master branch of the origin.

Branches are more than just "folders"

In svn a branch and a tag are really implemented as "folders".  The process of branching and tagging is very fast in svn, but doing a "svn checkout" on an existing branch/tag can be time-consuming depending on size.  git branches (and tags) do not show as filesystem subdirectories.  They instead are displayed as a kind of "third axis" that I explained in a previous post.  I love this concept, but it causes grief for those who learned VCS on svn, tfs, or p4c.  This can be confusing and can lead to noobs creating branches by copying source files from a root folder to a new folder they just created.  They are trying to make their folder structure look like what they would expect in svn.  That won't work properly in git.  

git subfolders do not function like svn subfolders.  Design your git repos accordingly

It's not uncommon for a large organization to have one HUGE svn repo where each root level subfolder is a separate project.  If you've ever looked at your svn folder structures EVERY subfolder has its own .svn hidden folder.  This means that you can "svn checkout" at any subfolder in a HUGE tree and get just that folder recursively.  This saves a bit of time.  

You need to unlearn that behavior in git.  In git you get the entire repo...there is no "git clone" or "git fetch" of a subfolder.  

So, when migrating from svn to git you should also consider restructuring your repos if you have one big enterprise repo with lots of subprojects.  This leads to the misperception that git can't handle large repos.  Not at all.  I use very large repos everyday and I think they peform better than svn (anecdotal only).  I've worked with svn repos where if I did a "svn checkout" at the root it would take hours.  So the solution is to only get those "subprojects" that you need.  In git you need to have separate repos that you clone.  

But my organization absolutely MUST have subfolders-per-project and one giant repo

That seems doubtful to me but you can still do it.  For instance, if you have shared code classes you probably do want to have that shared code in only one place.  In these cases the git feature you want to research is "git submodule".  git submodule allows a foreign repository to be embedded in a local repo.  In the case of the organization with one giant code repo you would have a master git project and then use git submodule for distinct sub-projects under that.  This performs better and is logically more consistent for many.  This allows your organization's "code portfolio" to exist in one location.  

There are other great uses for git submodule.  The most popular is when you want to integrate some open source code into your existing software.  If you use a submodule then you can maintain your codebase's version separate from the open source module's code.  You can "take on" new versions of the module when you want to.  

If you've ever had to participate in a "black duck analysis" then you'll recognize why some company's INSIST on one central software sourcecode repo/portfolio.  git submodule is what you should be using.  A black duck analysis of your codebase looks for inappropriate use of open source software in home-grown software.  Depending on which license (GPL, GNU, copyleft, etc etc) is applied to the open source code your organization may be required to open source its software as well, which it may not want to do.  Black Duck looks for legal liabilities caused by the inappropriate introduction of open source software in your code.  A company I consulted for implemented MongoDB without appropriately understanding Mongo's GNU AGPL v3 licensing.  This caused a Black Duck audit issue and the company was forced to buy commercial licenses for Mongo to avoid open sourcing its product.  

If you did some research and didn't like "git submodule" there is another alternative.  "git sparse clone" and "git sparse checkout".  Essentially this creates a kind of filter config spec in your config file that says you only want to clone certain trees.  I've read that this can be dangerous when merging sparse trees so I've never tried it.  

How do you revert a change in git?

The svn command I use most regularly is "svn revert" which essentially is the "undo button" for any changes you have made.  Your file(s) will be rolled back to what the HEAD is.  This is especially useful if you use SSIS or Visual Studio where it is necessary to check out a file even if you only want to VIEW its contents.  In git you must:

git checkout <file>

This essentially is telling git to re-checkout the given file and replace your current changes.  

Handling file moves/renames/copies/deletes

It seems like every VCS handles file renames and moves less-than-optimally.  At a minimum every VCS I've used has lost the file history of a moved or renamed file unless the move/rename is done from the VCS tool and not the OS or the IDE.  Git is no better here.  In fact, since git is less concerned with "files" than with "snapshots of the current repo" you don't even get an option in git to denote that a given change is a rename/move.  This causes some SCM people to freak out that their history is lost.  It really isn't, you just have to "remember" that the older history is associated with the older file name.  

I never rely on my VCS to handle moves/renames nicely.  Instead I always do a move/rename/delete as its own change with a commit message that clearly denotes what I'm moving/renaming/deleting and where the destination is.  This will help the next person who is trying to find the older history.  

If you are using git and not a GUI you will see, sometimes, a message after your git commit that indicates that git determined a file was a rewrite or rename.  Git has a heuristics engine that determines this for you but it isn't perfect.  If you rename a file at the same time you switch all CRLFs to LFs, it won't work.  It does work flawlessly with file copies which means it won't need to keep the copy in the repo wasting space.  However, I can't think of many times when files are copied and maintained in a VCS.  

You need a git utility that looks just like TortoiseSVN

TortoiseSVN seems to be the svn gui that everyone uses.  The closest thing is git extenstions with shell extensions turned on.  I prefer just using gitgui and the git bash shell that comes with msysgit, but if you need something closer to TortoiseSVN, git extensions is what you want.  

In the next post I'll cover one last git gotcha for git noobs that can have some serious ramifications.



Nifty Git Features

This is a continuation from Migrating Subversion to Git Part 2.  In this post I want to cover some really cool git features that no other VCS has.  These are git features that may not be well-known by everyone. 

Git Squash

In the last post I mentioned that you can have many local commits, perhaps after every save, and then instead of pushing all of those commits to the origin/master, simply make one big aggregated commit.  Git squash is how you do this.   

But first it's important to understand the different ways to integrate one branch's changes into another branch.  You can either merge them, or you can rebase.  The net result is the same...the originating branch accurately reflects your code changes.  A merge maintains all of your smallish commits...a rebase will "replay" your changes onto the originating branch which appears as one commit, which is cleaner to some people.  A rebase is similar to taking a patch file in svn and running it on a different branch.  

There are various ways to accomplish a git squash but this is where I really like giteye which is free.  GitEye is from CollabNet, the makers of the best svn gui ever, TortoiseSVN.  GitEye's interface is a bit too visually-busy for me, but it makes git squash totally simple.  Without GitEye you need to resort to bash shell commands and vi/vim which is NOT a utility for the faint-of-heart (mentioned above).  There are plenty of tutorials on the web regarding how to "git squash" using bash...I never had any success.  Here is where a gui helps.  

Squashing is a bit of a cumbersome process unfortunately, even with GitEye.  The maintainers of git, I believe, realized that as well and there is now "git rebase --interactive --autosquash".  The theory is that when you merge your local to the master branch you would run this single command and it does it all. It simply requires a little planning...your local commit messages must start with "squash!" or "fixup!".  

Even that is not idiot-proof enough for me.  I would expect that --autosquash is absolutely automatic.  It isn't.  I took that opportunity to "scratch the itch" and I made --autosquashSILENT command for our local git code that does --autosquash silently.  In open source software "scratching the itch" is when you change the code to fix your company's needs.  (I have not submitted this as a merge or feature request to the git maintainers...I did this mainly to learn git a bit better.)

In any case squashing commits is a great feature.  


I briefly touched on this in an earlier post.  git-svn is essentially a bi-directional bridge to subversion.  As a developer you can use git as your client to a svn server.  This is how I started to learn and use git.  This allows developers to test the waters and use all of the cool features svn lacks (local commits and branching, use of the staging area, rebasing/squashing.  This is how we got our old fuddy-duddy developers on the git bandwagon...by making it a skunkworks project.  One little goofy thing is that your svn branches will not show in git because those are considered remote branches.  

git stash

I already wrote about this in a previous post but I love it so much it warrants an entry in this post.  git stash is the equivalent of a shelf set in TFS or a patch file in svn.  

git stash save "I was working on FeatureX before QA bothered me about some important ticket."

git stash list //shows you available stashes

git stash apply "I was working on FeatureX before QA bothered me about some important ticket."  //pops the stash back onto your code


This is the equivalent of the .svn folder but is only in the root folder of your project, not every subsequent subfolder like svn.  One annoyance of git is the constant prompting for user/pwd when doing a push or pull.  Instead you can add these entries to this file...something like this: 

[remote "origin"]
Please note that this means you have a cleartext password in an unencrypted file.  But it does save a lot of time.  
git-annex is a separate install and isn't something that I use for development at all.  git-annex (ga) allows you to manage files with git tools, without actually checking the files into git.  It's also kinda like DFS (distributed file system) and BitTorrent all rolled into one.  I'm sorry but I can't explain it much better than that.  Perhaps a use case will help.  If you are like me you probably have your important files scattered across several different locations/storage media:
  • My Documents
  • Desktop
  • A ubuntu machine
  • dropbox
  • fileshares
  • CMS Servers/WebDAV/ftp/OwnCloud
  • google drive 
  • SmartPhones
  • skydrive (or OneDrive or whatever it is called now)
  • Your spouse's laptop
  • Various USB sticks
  • Offsite archived data

With ga you create one central folder structure that can "point" (like a symlink or junction point) to each of these locations.  You can choose whether or not to actually store a copy in git.  You can also choose to have a given storage location be mirrored onto another annex location with ga.  ga will take care of the synchronization for you.  If you attempt to access a file in the ga that lives in a location that is not available you will be prompted to connect to one of the possibly many places that file may reside (dropbox and a USB stick for instance).  You can then backup everything to one drive that you store off-site.  

I have a ga structure that has a folder entry for every known location where I stash data.  I let ga worry about duplicating the data onto different media.  If I ever needed to wipe a file ga would tell me every location that has that file and I can delete it in the local repo and push that change to the other media locations.  

Other use cases:  

  • sync your podcasts from your phone to PC to tablet, etc.  
  • sync your pdfs and ebooks from device to device.  I keep a separate file that I can bookmark my current location and notes.  
  • You have a huge music library and you want your android to have Zeppelin on it and your IPod to have Lady Gaga on it.  
  • Securely share all of your passwords with your spouse 

I find new use cases for git-annex almost daily.  git-annex assistant is a GUI tool that takes some of the guesswork out of the git backend behind ga.  If your spouse is not git-savvy this will make the transition to git-annex painless.


Getting SSIS Packages to Run Faster in Visual Studio

I thought this was commonly-known but I find that quite a few SSIS developers do not know this.  SSIS packages are known to be horrendously slow when running in debug mode (F5) in Visual Studio.  It probably has a lot to do with all of the UI painting.  If you really don't need to see debugging messages (perhaps you have a good logging framework you can rely on instead) and are really just executing your packages for a smoke test or whatever, there is a faster way.  CTRL+F5.  (Start without Debugging from the Debug menu).  The package will execute in a cmd window using dtexec.  Radically better runtimes (YMMV but mine is an order of magnitude faster) and also far less system resource consumption (memory utilization especially).  

This is a huge timesaver for me.  

Getting Started with Git

This is my third post in my git series.  In this post I'll give you my tips for getting started with git if you've never used it before.  


Migrating Subversion to Git Part 2

This is a continuation from my last post, Migrating Subversion to Git Part 1.  I'm continuing my list of git features that set it apart from other VCSs.  These are the core reasons why we migrated.  
Ability to Obliterate
Invariably some idiot will accidentally check-in a HUGE avi file of their kids playing t-ball into your subversion trunk.  Once it is there it can't be removed.  You can remove the file but it will always be there in history and can be retrieved if you get that given revision number into your workspace.  That makes for a unnecessarily large repo.  
This can be problematic for other reasons.  Say someone accidentally checks in some prod passwords or an excel file with everyone's salary.  In git you can actually do an "obliterate".  There are multiple ways to rewrite history..."filter-branch" is one way.  I don't want to bore you with details but every commit is "replayable" with git so you can alter history by replaying every commit and altering it via script to suit your needs.  
Maybe your experience is better than mine but it seems like source code repositories never get backed up properly.  I've had to reconstruct 2 TFS repos in my career because no one bothered to back up the underlying SQL Server databases.  Each time we lost all history and had to merge developers' local copies to determine what the current trunk looked like.  Luckily we had all releases' source code as an artifact of our build process so we could still support our customers just fine.  I've also had to recover a svn repository once without a backup.  Not fun.  
git has no backup process (that I'm aware of) because it really isn't needed.  When you do a git clone/fetch/pull you get the full history of the entire repo.  You may think that will be huge but it's actually much smaller than the svn equivalent is.  git is totally distributed remember...meaning you really don't need a replica or a backup.  
Ability to work with multiple repos on same project at once
If you download svn you can set up local repos and commit to those without even having a server.  Same with git.  Or you can "svn checkout" to your local folder and commit back to the server.  But in svn you can't commit locally and remotely at the same time without at least doing a "svn relocate" locally.  Here is where git really shines.  I can do a clone from a public repository (say codeplex) and then make a bunch of local changes.  I can then push them to my company's git server if I want to integrate that code into a company project.  My local repo can then fetch from the public repo as much as it wants and then integrate locally and push back to my company's server.  In git you simply change the "remote".  Very cool.  And this leads to...
Local Commits
I like to commit every time I save a file if it is work that I am uncomfortable with or is risky.  After each save I can then run my unit tests.  With svn this is difficult because each commit is being sent to the server and I may not want those commits visible to the world.  (People tell me I commit too much and it's annoying).   Furthermore, if I'm working at the park without connectivity I'm kinda stuck again.  In svn the best you can do is save a bunch of patch files which is obnoxious.  
With git I can commit locally as much as I want and as long as I don't "git push" no one else will see them.  
Kind of.  
The problem is, when you do eventually "git push" you will push all of those small commits.  I have a trick to fix that in the next post, Nifty Git Features.  
Spartan GUI Tools
You may think it is odd that I would classify "spartan gui tools" as a reason to migrate to git.  The fact is, for what most developers use git for day-to-day (fetch/merge/commit/push) gitgui is perfect.  A few mouse clicks is all it takes.  Compare that to GitEye.  It's interface is so busy that I get a headache using it.  
gitgui is the "default" git GUI that you get when you install msysgit on Windows.  There are lots of other choices for git front-ends.  
GitEye is my favorite after gitgui.  It is absolutely great for the more complex, esoteric git commands and I do use it regularly.  I'll cover a few cases in nodetitle.  But for day-to-day use nothing beats the gitgui interface.  
Amending Commits
Purists will tell you that you should NEVER amend a previous commit even if you could.  In most VCSs you can't.  In git you can.  git commit --amend.  I use this regularly.  How often have you checked-in some code and realized that you missed a file?  That requires a second commit and now your commits are not atomic with your "project".  Amending commits is perfectly safe if you follow some basic common sense rules.  Never amend a commit once it is pushed to a remote.  If you do you could have others committing on top of your now-amended commit.  That leads to confusion and disaster.  But I see no problem with a quick amended commit if it is safe and will ameliorate later confusion.  
History Visualization
Visualizing the history of changes is important in a large software project.  If you have multiple supported releases for multiple customers it is helpful to have a good history visualizer to figure out where bugs were injected and every release branch that will need a hotfix.  I hate to paint with broad strokes but TortoiseSVN's history visualizer is not that great (IMHO).  My biggest complaint is its slowness over large repositories.  It's also a centralized VCS so you must be connected to the svn server.  gitk is the git history visualizer and I find it is faster and easier to understand.  My only complaint is the SHA1 IDs for every commit wastes a lot of screen real estate.  Otherwise gitk wins hands-down.  
How to actually migrate a subversion repo to Git?
The answer is "it depends" on your requirements.  Do you need all branches?  All history?  Just HEAD?  Obviously the less you migrate the less potential for goofy errors.  We tried lots of tools including gitsvn which I'll talk about in a few future posts.  It works great as a git client for a subversion backend, but I think it is less-than-adequate for a svn-to-git migration.  
The tool we settled on with SubGit.  You start by creating a writeable git mirror of your subversion repository.  You can then use any of tool you choose.  If you commit to git it the change will be bridged to git and vice-versa.  If you have strict initial rules that dictate that svn is the "official" repo for your organization then you can rest assured that you can iron-out any migration bugs without losing source code changes.  
Over the last few posts I've covered the features of git that make it a superior VCS IMHO.  In the next few posts I'll cover how to get started with git as well as some important differences with Subversion that may get you into trouble if you are not careful.

Migrating Subversion to Git Part 1

I assisted one of my clients in migrating a subversion (svn) repository to git (and gitlab).  Over the next few posts I'm going to cover the experience.  If you've never used git or a distributed VCS you might want to read these posts.  I'll cover why we moved to git, how to get git up and running quickly (even if it's just for your own professional development), and some major headaches you'll likely have with git if you are coming from svn or even tfs.  

What is git?

I'm not going to cover the basics of git.  I assume you know the basics already.  If you don't you should read Pro Git, which is free.  

Why learn git?

Let's say you have svn where you work and have no plans to migrate.  You think it works great for your needs.  The fact is, the rest of the world seems to be migrating towards git and other distributed VCSs.  If you ever want to work on a CodePlex project in the future you really need to learn git.  The same holds true for just about every open source project...git is the de facto standard.  Even if you don't care about open source learning git will help you be a more productive developer.  Having a cursory knowledge of Oracle will help any SQL Server guy understand Microsoft's offerings, and shortcomings, in the process.  

Why migrate to git?

My client had a huge Subversion (svn) repository and it just wasn't working anymore.  I had used git before for things like codeplex and drupal, but have never actively done team development with it.  It seemed "ok" to me but subversion always suited my needs and I felt no compelling reason to learn git past a cursory examination.  But everyone raves about it and I know that I am certainly not an expert, but if others are loving it and having success with it, it would probably be better for my client's needs as well.  I proposed migrating one of the smaller repositories to git as a proof-of-concept.  

This blog post and the next will list the problems and annoyances with svn that we were hoping to solve with git.  

Truly Distributed Version Control

My client was off-shoring more work and it was a nightmare maintaining a svn repo in both India and the US.  Git is truly totally distributed.  Now, at some point you'll probably need to have a true "master" version of the truth.  This is very easy to do.  We instructed our India team to do a "git push" to the US repo regularly.  The US replica would be the sole version of the truth and source of all releases, builds, and release branches.  

Fast (especially branching)

Everybody says that branching is "cheap" in central VCSs like svn or TFS.  Yes, the "branch" process is "cheap" (files are not actually copied, rather a pointer to the original plus any "patch deltas" is all that is persisted).  But the real problem is the time it takes to switch a workspace from one branch to another.  This involves copying all of the project's files into a second directory.  Internally to svn the process is fast, in reality, for the developer, this was averaging about 10 minutes for our largest project.  Plus a lot of local disk space.  TFS works the same way.  This means that developers resist branching when they should be.  In git I have yet to see the branch process take longer than a second.  Even on really large repos.  

Work Totally Disconnected

How many times have you done a "svn up" and then tried to do some development at the park or on a plane.  It works fine until you need to compare your current version with some version from a couple releases ago.  You can't do it in svn.  You can in git.  When you do a "git pull" (or fetch or clone) you get the whole history of everything locally.  If you do off-shoring and have spotty WAN links, this is a great way to keep your team productive.  

Branches are logically branches, not subfolders

If you have a project or build process that recursively processes folders, svn branching gets in the way because the branch is really another folder on your filesystem.  If you branch at the wrong place your recursive processing will break.  For instance, there are two generally accepted branching strategies with Subversion..."branches under the project" or "projects under the branches".  The graphic at right denotes the former, which is the far more prevalent branching strategy in my experience.  This is depicting the "calc" project which contains the "trunk" code (and each organization uses "trunk" a little differently) as well as a branches folder with subfolders for each and every branch.  If you were to do a "svn up" from the root "calc" folder, and you were branching properly, you'd have a LOT of files transferred locally, and you probably don't care about those older branches.  

Logically to the noob developer when you need to merge from "my-calc-branch" to "trunk" it appears as though we are merging files.  That's logically wrong.  It's all the same file, we are simply merging branched versions.  Subversion makes this conceptually muddy which is why people aren't branching as much as they should be.  

So, how does git do it?  See the graphic at right from gitlab.   Each folder is truly a subfolder.  There are no folders that are really branches like in subversion.  

Branching in git is handled "on a third plane", or a Z axis.  That third axis is equivalent to the branch.  When you look at your project's files in something like gitlab you'll even note that the branches are represented by a separate drop down that changes the file lists.  Logically this makes so much more sense to me.  A branch really isn't a folder and it shouldn't be represented as one.  

When you do a "git fetch" (or pull or clone) you actually get every branch that is available and you can browse them using gitgui (or whatever tool you want to use) just like you would with gitlab.  

Real Shelving (Stashing)

Ever have this happen?  You are diligently working on some task with a bunch of source code files checked out and in a state of disarray when someone will approach you and need something critical fixed NOW.  Yes, I know that a good kanban-er will simply reply, "add it to the backlog...I'm at my WIP limit", but the fact is life is never this easy.  Basically the requirement is to "save" your WIP somewhere else, revert it, and go work on this new issue.  

In TFS you have Shelf Sets (shelving).  Awesome feature.  The shelf process saves off your changes to the server, reverts your code, and when you are ready you simply unshelve your changes, merge if you have to, and continue working.  You can even share shelf sets with others.  The biggest problem is shelf sets don't travel across branches.  

In svn this can be tricky.  If you are doing your development in an isolated local branch that only contains only your changes then you have no problem (and you really should be doing this...but again, branching seems to be under-utilized).  You simply go to a new folder structure and do a svn pull on the given branch and do your work there.   

But unfortunately, too many people don't take the time to branch their dev work locally when using svn because it takes too long (see above).  Now you have a problem because you need to revert your changes, but you don't want to lose them.  Your best bet in svn is a patch file but the patch file is not pushed to the server like TFS.  Don't lose it.  

This whole process seems way to difficult.  In git you have the "git stash" command.  Conceptually it works just like a branch but gives you some extra commands.  You could simply have another local branch because remember that in git branching really is cheap and quick.  

I'll cover more reasons we moved to git in the next post, nodetitle.  


Pavlovian and Operant Conditioning

In my last post, Post-restore Process, I outlined a process my DBAs use to ensure a restored database is always setup properly (TRUSTWORTHY, ENABLE_BROKER, sync logins/passwords, etc).  The actual proc is ridiculously simple and there's nothing new there as far as TSQL goes.  The point is that the process is so patently obvious (to me), yet I see so many environments where things are not sync'd properly after a restore.  And this is at every client engagement.  So, why did I post something so obvious?  The hardest part of implementing a process change like this is the TRAINING.  Your DBAs have probably been following the same processes for years and are resistant to change. 

So how do you get behavior modification quickly and cheaply in Corporate America?  Use Pavlovian and Operant Conditioning.  I knew when I rolled out PostRestore at my company that our 30-year veteran DBAs would be resistant to a process change.  I've seen this behavior at many companies.  Just mention "agile" at a legacy software company and you'll hear the moans and groans, yet they'll be receptive to the underlying ideas if they are not presented as being "agile".  That is resistance to change.  When trying to affect change in the past I tried training sessions, being overly nice, hootin' and hollerin', documentation, wikis, blogs...nothing worked.  So when I rolled out PostRestore a number of years ago for my last company, I decided to try a new approach.  

Pavlovian vs Operant Conditioning?

You've likely heard of Pavlov's dogs.  Pavlov rang a bell and followed it with a treat.  Pretty soon if Pavlov just rang the bell the dogs began salivating in anticipation of the treat.  Operant conditioning is reinforcing good behavior, or punishing bad behavior, to affect change, likely also by providing or removing a treat.  BF Skinner is most associated with operant conditioning.  The main difference is pavlovian responses are more automatic, reflexive actions outside of the subjects control (salivation) and operant conditioning applies a reward or punishment ex post facto.  

How to Use Conditioning to Your Advantage

Grab a bag of candy.  It should be your favorite candy (might as well be something in this for you too).  It should be one type of candy, maybe even one brand.  A grab bag of miscellaneous Halloween candy may work, but not as quickly nor as well.  Your candy can be Smarties, Snickers, Skittles, pretzels, whatever.  But pick one.  We want our people to associate a behavior with this candy.  I used Snickers bite-sized candies (my favorite).  

When I most recently presented PostRestore (a number of years ago actually) I wanted to use Pavlovian and operant conditioning to get the DBAs to use this immediately.  There were too many non-prod envs that weren't properly setup, causing delays and hard-to-dissect bugs.  I passed around a Snickers to every participant every time I executed the proc during my demo.  Now, really, how hard is it to execute a stored proc after you restore a database?  It didn't take long before the subjects, er DBAs, were getting antsy.  They clearly felt I was wasting their time.  No matter, the goal is not training, rather behavior modification.  I then had a handful of the DBAs each try executing the proc for the group by coming up to my laptop and pressing F5 while I projected the image to the group and providing positive reinforcement...and I gave each subject a Snickers for being a good assistant.  As people were leaving I gave them another Snickers as I reminded them to, "remember to always run PostRestore."  By now I was getting strange looks from the DBAs...why so much candy?  

This is a bit Pavlovian and a bit operant.  But I was just getting started.  I knew that invariably I would still have to deal with support issues for a few weeks because Service Broker was disabled (which was the hardest to diagnose due to its asynchronous, non-GUI nature).  The cause was always, of course, PostRestore not being run.  This meant I still had some behavior to modify.  I would then walk over to the DBA and tell him he didn't earn his Snickers today.  The first few times I did this I got a blank stare, followed quickly by that unmistakable lightbulb going off over his head...indicating he just remembered he forgot to run PostRestore.  This was followed by perfuse apologies and guarantees that this would never happen again.  As I would leave I would toss a piece of candy at him.  This is obviously operant conditioning.  

So I had operant conditioning covered, but why not got a little Pavlovian too?  Right after my first training session I arrived at the office a bit early and place a bowl of Snickers right in the middle of the DBA bullpen.  No note attached.  My hope was the visual image of Snickers sitting in the middle of the room would subconsciously help at least one DBA to run PostRestore.  I would pop over every morning for a few weeks and top off the bowl.  

Within a month, using the Pavlovian and operant conditioning steps I mentioned above, I never had to troubleshoot a problem due to a disabled broker on a restored db.  NEVER.  

That's behavior modification.  It didn't require and hootin' and hollerin'.  It didn't require a line item on an annual review.  It didn't require nasty emails.  It only required a little effort on my part and a sawbuck's-worth of Snickers.  

You may find this story funny, you may find it degrading...but you will find that a little conditioning will affect behavior modification.  


Here's a few reasons I know conditioning works.  

  1. We recently got a new batch of interns in our department and they were sitting through Basic Training with the DBAs.  I happened to be in the room for another meeting that was boring me to tears.  I overheard one of the DBAs say, "and every time you restore a database to refresh an environment make sure you run PostRestore."  I then saw him subconsciously reach over to the candy bowl and grab a piece of candy.  I did a spittake with my iced coffee.  This occurred 3 years AFTER the initially training session.  
  2. I swear our DBAs have all gained weight.  
  3. One day I was passing through the DBA team room and I swiped a piece of candy from the bowl.  One of the smart alecky DBAs yelled after me, "Hey, if you're not gonna stay and help us restore these environments don't bother taking our candy!"  I just chuckled and kept walking

Post-restore Process

If you are involved in a lot of development work with a lot of environments then you probably get frustrated when you are given a new environment and it isn't set up correctly.  For instance, the CM guy (or DBA) forgets to sync the sql logins and db users.  Or, someone forgets to enable service broker.  Or, you use the CLR (sloppily, I might add) and your restored db isn't TRUSTWORTHY?  In all honesty, it's tough to blame the DBA or CM Guy.  We devs love to change our requirements constantly and turn on every whiz bang new feature that Microsoft gives us.  This causes the DBAs fits when creating repeatable environment builds.  I once consulted at a place that had a 3 page document on miscellaneous "tasks" that had to be performed in a newly restored database before it could be handed over as a non-prod environment.  

That's just crazy.  And it's difficult to maintain a document like that because at any given moment you could have multiple build documents based on how many releases and new features you need to support (like Service Broker).  Or you may have customers at various releases that need different processes executed.  

The solution is simple.  

  1. Create a PostRestore stored procedure.
  2. Compile it in your db.  
  3. Train your DBAs and CM guys to ALWAYS run this proc whenever a db is restored.  Any db.  

What are the benefits of doing this?  First of all, you can distill that 3 page restore manual into a simple stored proc call for them.  That's a VERY good example of DevOps.  Second, you have the ability to "version control" your db restore process.  If you decide you need to ENABLE_BROKER in the next release of your software you simply make that change to your PostRestore proc and check it in with that release's code.  Now your DBA/CM Guy doesn't even need to know this is a new "feature" you are enabling/deploying. 

It would be really great if Microsoft gave us a way to run this automatically whenever a db is restored.  I've tried server triggers and custom events and have not found a better way to do this, short of creating a SQL Agent job that runs every minute looking for a newly restored db and then running the command...and that seems insane.  Somehow this seems like it should be a part of a contained database.  


NoSQL Evaluation Summary Post

Over the last few months I've been summarizing my experiences with NoSQL products.  I have been evaluating these products as alternatives to SQL Server to save some licensing costs for one of my clients.  I tried to be as objective as possible as I've built my proofs of concept on these products.  In every case I'm happy to say that the POCs have been a success but I believe my client will be staying with SQL Server for now.  Some of these POCs and posts I have been working on for over a year.  

Here are all of the posts in the series:

Will MS Licensing Drive Customers to NoSQL?:  A summary of why I'm doing these POCs and what I hope to accomplish.  

Are some new SQL Server features a response to the NoSQL movement?:  Covers some interesting new features of recent SQL Server releases that appear to be directly challenging some of the features of various NoSQL solutions. 

What exactly is wrong with SQL and RDBMSs?:  What makes these products so enticing to developers and management?

Tutorial D, D, Rel and Their Relation to NoSQL:  What is driving the NoSQL solution...is it that SQL is just to darn how to use and not really expressive enough for relational problems.  I explore some shortcomings of SQL in this post.  

Data Models and Data Organization Methods:  compares how data is persisted and modeled in these non-relational models.  Don't try to shoehorn a relational design into a NoSQL solution.  

Eventual Consistency or ACIDs and BASEs:  You have to give up ACID properties (some of them anyway) when you adopt a NoSQL solution.  In their place you get BASE properties.  I cover those in this post.  

Handling Conflicts with Eventual Consistency and Distributed Systems:  One of the keep tenants of BASE is "eventual consistency".  This freaks out ACID people.  In this post I'll cover why this isn't so scary if you understand how it works under the covers, and some limitations.  

The CAP Theorem:  Consistency, Availability, and Partition Tolerance...you can have 2 of these 3 with any data management platform...RDBMS included.  We'll go over which of these you may be able to sacrifice and then which NoSQL products will support your requirements.  

Sharding:  Sharding is one of key methods used by NoSQL products to handle CAP and BASE issues.  Sharding is more than horizontal partitioning across nodes.  I cover sharding in this post.  

SAP HANA Evaluation:  This is the first POC I did and it was wildly successful.  The SQL syntax is much like TSQL and it really is blazing fast.  I'll show you how to setup your own HANA evaluation instance in the cloud.  

Graph Datastores:  covers what a graph datastore is and when you might opt to use one.  

Hekaton:  briefly goes over this new feature in SQL 2014 and how it is good for some NoSQL applications.  

Windows Azure Table Service:  this is a schema-less key-value store that Microsoft offers in the cloud.  

MapReduce for the RDBMS Guy:  I go over the minimum that every good relational data architect needs to know about MapReduce.  

Parallel Data Warehouse as a NoSQL Alternative:  this is another specialized offering from Microsoft that is very NoSQL-like.  It essentially gives you parallelized query execution on multiple nodes.  

MongoDB and CouchDB:  this briefly covers my evaluation of schema-less document stores.  

HBase: is a step-up from Hadoop, IMO, for cases where you expect a lot of random reads.  

Querying NoSQL with Hive and Pig:  I found HiveQL to be very easy to learn if you understand SQL.  This makes MapReduce SO MUCH EASIER to use when you need to get up to speed quickly.  

More NoSQL Solutions: is a brief write-up on some other products we evaluated.  We found quickly that they did not meet our requirements so we did not pursue them in earnest, but it's good to know they are available.  

Apache Flume:  is how you "stream" data into your Hadoop instance.  This is also the easiest way to load test data for POCs.  

SQL and PostgreSQL as NoSQL alternatives:  if you are just looking to save on licensing fees you may want to consider an open source alternative to your RDBMS vendor.  Or, consider forking the source code and making these alternatives fit your exact requirements...if you are THAT bold.  

On the Future of NoSQL:  I make some predictions about the bright future for NoSQL by briefly looking at the history of data management that led us to the ubiquity of relational data management and SQL.  

On the Future of NoSQL

In this NoSQL post I want to share my predictions for the future of NoSQL.  But to understand my predictions you must understand the history of data management.  This will be REAL brief.  It kinda all starts with Codasyl.  Codasyl (Conference on Data Systems Languages) was a consortium started around 1959 to develop a standard data programming language that could run anywhere and was easy to learn.  The first attempt gave us COBOL (COmmon Business Oriented Language).  Within 10 years COBOL evolved to include a DML and DDL language that, though not relational (more like a network model), worked well.  At some point there was a realization that a specific higher-level declarative "data" language was needed.  This was the early 1970s and IBM was looking for a data language for their relational System R and implemented and championed SQL.  That was a VERY concise history lesson.  Codasyl is mostly dead today due to the advent of relational models and "SQL" but they do have some niches.  Healthcare (one of my areas of expertise) is one.  In my mind this means that either a)businesses are too cheap to rewrite these apps into something more modern/relational or b)they just work and relational/RDBMSs maybe aren't the best solution.  

But why did relational managers take off while network data managers died out?  The biggest reason is that most business problems are better modeled relationally.  Next, relational vendors had all adopted SQL. Network/hierarchy models had disparate, fragmented languages.  Suddenly things were cross-vendor-capable in the relational world, and even when vendor SQLs weren't perfectly compatible (even today PL/SQL is definitely different from TSQL) at least the learning curve wasn't very steep (for instance, JOIN syntax is a little different among vendors, but semantically the same).  That cannot be said for hierarchical and network managers.  They each had their own language.  Some looked Algol-ish, others looked like Fortran, etc.  No standards = less adoption.  

The lesson is low-level user interfaces for data manipulation is not a good idea if you want market adoption.  

SQL, for all its faults, is easy to learn.  Not so for network and hierarchy managers.  MUMPS (a hierarchical system that is older than me and still used in healthcare today) has a horrendous language (called M) to learn.  

If you've never written a "query" without a declarative language, like SQL, on a old-time database manager...well, you're missing out on some real fun.  Imagine writing code to open a "file" on a tape and implementing a "nested loops" construct to join it to another data file.  You might have heard of this before...it's called ISAM (or, later, VSAM, which was a little better) and it is up to you to understand the physical retrieval of the data vs just worrying about declare what you want (like SQL).  Worrying about physical retrieval of data is fraught with danger.  Lots of bugs if you are a novice and don't understand hashing and indexing.  
These physical data retrieval issues were soon identified and rectified with ISAM after it was first released.  It didn't take long before IBM realized that a "query optimizer" embedded in the product could determine for the business user how best to tackle a query.  And then ISAM got popular.  
To repeat...the lesson is low-level user interfaces for data manipulation is not a good idea if you want market penetration.  Higher level languages brought relational databases to the masses.  You didn't need to be a programmer to use them.  The same thing will happen with NoSQL.  That's my prediction.  When the process of writing MapReduce jobs is no more complicated than writing a HiveQL statement, then NoSQL products will really take off.  
We seem to be repeating history a bit and re-learning the lessons of the past.  Writing MapReduce seems to me to be equivalent to writing ISAM code.  It's tedious, can perform poorly if your knowledge of how data persistence works isn't good, and is prone to error.  This isn't the first time our industry has retrogressed in data management.  15 years ago XML was being touted as the app that would finally kill the relational database.  Obviously "tag bloat" is probably the biggest reason that never occurred, but another reason is simply the effort and learning curve needed to query XML.  It's getting better, but it's still not easy enough for dumb guys like me.  And look at how many methods exist to query XML...XQuery, XQL, XPath, XSLT.  That's not user-friendly...and user-friendly is how you win business.  
So what NoSQL needs, IMHO, is a standards-based approach to querying that the average business person can use.  Something like HiveQL.  And a lot of the vendors understand this too and are developing really good tools that are close enough to SQL that the average business person will enjoy learning them.  
I did.  
But the problem is there is no standard, so each vendor's query language can be very different.  But the common denominator is they all try to model themselves on SQL.  
I've already written about HiveQL as an example of a declarative language.  Another example is CQL, which is Cassandra's query language.  I did not get a chance to work with CQL closely, but the DML syntax, at least, is almost exactly like ANSI SQL.  There are of course extensions to the language to handle those items unique to Cassandra and NoSQL that you won't see in a relational language, such as BATCH (makes multiple DML statements work atomically), CREATE KEYSPACE (document definitions), and CONSISTENCY (concurrency isolation level control).  There's the rub...those "statements" that are specific to non-relational products need to be standardized.  KEYSPACE in one product is DOCUMENT in another.  

As of late 2011 Couchbase even has a "SQL" dialect originally called UnQL (pronounced "uncle").  They have since renamed it to N1QL (pronounced "nickel").  It is still in beta but shows promise.  Again, we dismissed Couchbase due to requirements, otherwise I think it would've been interesting to learn nickel (and hopefully I wouldn't have to cry "uncle" trying to get it to work).  N1QL also provides a series of DDL commands, just like ANSI SQL.  Couchbase has an excellent tutorial with sample documents and how to query them effectively.  

There's a trend here.  More complete higher level languages will win more converts to NoSQL.  In fact, if we could somehow get a proposed NoSQL language standard, like Codasyl championed, we may see mass adoption.  Write MapReduce or curl or java just won't cut it.  
Lastly, unlike most of my RDBMS collegues, I do believe there is a need for NoSQL products on both a logical and physical level. On the relational theory side, there are some things that just can't be modeled well relationally.  Graph databases fill one such void.  Think about the web of connections in your LinkedIn profile.  It's difficult to model these relationships relationally.  This is probably because most modelers (myself included) feel more at home with physical modeling than logical modeling.  The physical model for something like LinkedIn, if done in a purely relational database manager would be horrible.  In a graph db things are much easier.  Modeling hierarchies and transitive closure-type relationships is conceptually difficult for most people in a relational model.  Why not just stick with a product that is actually hierarchical.  
Regarding implementation/physical rationales for NoSQL, as we've seen in these posts, there are cases where the physical persistence implementation by the major RDBMS vendors is not always optimal.  There are excellent use cases for columnstores for instance.  Or totally in-memory RDBMSs (SAP HANA).  Or RDBMSs that natively support sharding while hiding the implementation details (Oracle grid stuff).  If there weren't, then we wouldn't see these features being bolted-on to the RDBMS vendors' products with every new release.    
I think the future of NoSQL is bright and I'm looking forward to working with these products and better understanding the technologies and theories behind them.