DaveWentzel.com            All Things Data

April 2014

SSIS Subpackages and Version Control

I use subpackages in SSIS for just about everything.  But I've noticed that I'm in the minority on this.  Instead, I see these huge monolithic SSIS/DTS packages that attempt to do a lot of different things.  Many people have written much about why you should consider moving more logic from your main package(s) to subpackages.  I'll briefly rehash that here.  One reason for using subpackages I rarely see mentioned is that it really helps with version control.  

Common Reasons for Subpackages

  1. Separation of concerns.  It is based to keep your "extract" logic separate from your "transform" logic.  This is just good fundamental design.  If you would normally use a separate function or module in your code, consider using a new SSIS package. 
  2. It aids unit testing, which is already extremely difficult with SSIS.  

One Big Pitfall with Subpackages

Just like with any other modern programming language, subpackages/modules/functions have certain scoping and lifetime issues that you need to account for.  A child package has access to everything declared in the parent package.  The problem is that child packages cannot alter certain variables in parent packages.  Also, child packages cannot pass messages back to the parent other than a simple SUCCESS or FAILURE.  These issues are not very common and with some creativity you can bypass these issues by storing "state" someone else instead of passing it around.  

Version Control

The most overlooked reason for using subpackages is version control.  SSIS is horrible when it comes to version control.  If you use any of the source control plug-ins for Visual Studio you'll immediately notice that even to VIEW the properties of a task requires the .dtsx file to be checked out.  How ridiculous is that?  And even simple changes like changing a default package variable value can result in many esoteric changes occurring to the underlying .dtsx file (which is XML).  So a visual diff is rendered worthless if you wanted to do something like a "svn blame" (determine who broke what, when, using a visual compare).  

SSIS Team Development Best Practices

  1. Avoid having multiple team members work on the same dtsx packages during a given release/sprint.  Due to the visual merge/diff issues just mentioned you WILL have a difficult time merging discrete changes to different branches.  Plan your development accordingly.  In fact, don't even bother to try to merge some changes to other branches.  It won't work.  For merging purposes, treat your .dtsx files as binary objects, not as source code.  
  2. Use BIDS Helper and it's SmartDiff utility.  It will help you pinpoint the really key changes you have made from simply formatting changes and internal SSIS versioning within the XML.  I can't say enough good things about this.  
  3. Your developers should make small, discrete changes per commit and avoid mixing layout changes with functional changes.  You want to limit how much change is done per commit.  This is why subpackages are also so critical.  If you must go in and re-organization the control flow layout, just make sure you do that in a separate commit.  This makes finding logic bugs easy later.  
  4. Every package have various Version properties.  Depending on your build tool and your environment these may be autopopulated for you.  Regardless, I find it helps to populate these values whenever I make a change to a package.  If you automate this, or at least try to enforce this rule on your team, you have a nice version control mechanism embedded into the file.  I know that embedding versioning info in source code is generally frowned upon...we can get that information from the VCS repo...I think it makes sense with dtsx XML files that have inherent versioning problems like I've mentioned thus far in this blog post.  
  5. It's also interesting to note that if you use the project-based deployment mechanism (.ispac) using Integration Services Catalogs you get some additional versioning information that can be helpful when troubleshooting.  You can see this by right clicking your Project under Integration Services Catalogs and selecting "Versions...".  In the screenshot I can see that Version: was deployed at '4/1/2014 at 10:38AM'.  

You have just read "SSIS Subpackages and Version Control" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  

Upgrading to Drupal 7 on WAMP

Upgrading Drupal 6 to Drupal 7 can be a real PITA.  As with a lot of open source software...the initial price is right, but support can be a nightmare.  Upgrading Drupal is a case in point if running on WAMP (Windows, Apache, MySQL, PHP) like I am.  I found no definitive place that spells out EXACTLY how to do this with all of the caveats and gotchas adequately covered.  My intention is not to pollute google's search results with another post for noobs ("First, download this, then run that"), rather to hit the things that aren't so obvious.  I wanted the upgrade to go fast...I use a lot of contrib modules and didn't have the time to do a manual upgrade.  I wanted to use Drush.  
First off, if you can, consider migrating to LAMP first.  You'll find much better resources for Linux than Windows on the interwebs...at least when it comes to Drupal.  Since all of my other websites are on Linux/Apache I probably should've done this first.  Drush definitely works better on Linux.  Every upgrade problem I have run into was due either to a module not working properly after upgrading...or drush.  There is even a bug where you can lose your entire db if drush is not properly configured (more on that below).  
It's been 5 years since I migrated to Drupal 6 from SharePoint 2003 and it was time to freshen up this site.  I still want to change the theme but that will come later.  My last theme upgrade to this site was 4 years ago.  My homepage still displays a rotating banner of my oldest kid on his first day of 1st Grade.  He's in 5th now.  
Getting Ready
  1. Go read this.  But note that it is not entirely accurate, at least not on WAMP.  And it assumes you know quite a bit about drush, php, and Drupal.  
  2. Install Drush 5.x (NOT 6...it is not capatible.If you installed 6 you'll need to uninstall it which can be difficult depending on whether you installed it with pear, git, or the msi installer (recommended).drush --version should be >= 5.4 and < 6.  
  3. Create a brand new VirtualRoot site on your server and a blank, brand new MySQL db.  
  4. drush dl drush_sup  (drush_sup is short for site-upgrade...like "rd" and "rmdir") .  You can read about site-upgrade here.  Make sure you run this from your drush installation folder, not your site folder. 
  5. drush cc drush (clear the cache) 
  6. cd /d "your_site_folder" (basically, where your Apache VirtualRoot points to)
  7. drush sup (so, install it in your drush folder, but run it from your website folder...otherwise, lots of errors).  This command is merely going to report to you how much of a PITA a D6 to D7 upgrade is going to be.  You are looking for entries like "NO 7.x RELEASES".  When you see that then you need to figure out what functionality won't work for your site on Drupal 7.  I had a whole bunch of these and I did a successful test upgrade anyway.  It's always worth giving it a try and continuing to the next step.  
  8. Now you need to create drush site aliases (sa for short) for the FROM and TO sites.  You always migrate (copy, really) a D6 to a D7 site.  This is why we created the shell db and shell site above.  First run drush sa (from your D6 site) to see your aliases.  Your new site won't be aliased there.  Let's fix that.   First run drush site-alias @self --full .  You want to doublecheck that you are "bootstrapped" to the right drupal directory.  You should see your uri and root.  Just confirm that it is right.  
  9. drush status will tell you where your alias files reside.  If you don't have one then go to \drush\examples and you'll see examples.drushrc.php.  Copy this to \drush as aliases.drushrc.php.  
  10. Add this code, changing as needed for your env.  
Now install Drupal 7  
By now you have a blank Apache site and drush installed.  Now we need Drupal 7 in that blank site.  
  1. cd /d <root of your new blank Apache site>
  2. drush status.  You'll see some Drush and php data but nothing on drupal yet.  
  3. drush sa @self should return nothing.  You don't have anything in this site yet.  
  4. drush dl drupal.  This will download the latest drupal 7 stable release (as of this writing).  It will place the folder structure under your site root.  
  5. cd <new downloaded folder>
  6. drush site-install standard --account-name=admin --account-pass=blah --db-url=mysql://root@localhost:blah/db_name
  7. do NOT get cute with the account password.  PHP has restrictions on the characters.  Make it a simple pwd for now.  
  8. In your browser, navigate to your new site.  No need to customize it yet.  
  9. drush status and just confirm everything
The Upgrade
  1. cd to your d6 site folder again
  2. drush status and doublecheck everything
  3. drush sa should show @dw7...let's check it with
  4. drush sa @self  doublecheck ...I know we doublecheck a lot but there is a bug in sup where your entire database is dropped if you don't get the syntax perfect.  (Yeah, check your backup now too)
  5. After saving the php file go back to your cmd window (in the root of your bootstrapped site) and run drush sa again.  Your new site should be on the list.  If you get an error then you didn't format your php array properly.  Now run drush sa @newalias.  
  6. drush @self site-upgrade --prompt-all @dw7
  7. choose 1...re-use the existing code  (although 2 may work too, I didn't test it)
You should be able to figure out the rest from here.  I did not have to deviate from the documentation from this point.  Hope this helps someone.  

You have just read "Upgrading to Drupal 7 on WAMP" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.  


One Last Git Gotcha: The Merge Pattern

In general I love git for all of the reasons I've outlined in the previous posts in this series.  There are some gotchas that can be confusing for developers moving from a centralized VCS.  None of these are a huge deal.  So far.  Today I'll post some cases where we've experienced disasters using git and some lessons learned that may be valuable for others.  Sometimes the strengths of a product ("distributed" source control and merge workflows) can also be its weaknesses if guardrails are not put in place.  

Have an authoritative repo

The distributive nature of git is generally viewed as a blessing.  A developer can pull and commit to any of many possible repos in an organization and in theory will "eventually" have a complete copy of the source code.  In reality I've found that doesn't work.  It is necessary to have an "authoritative repo" that has controlled access and is the basis for official builds of release(d) software.  The authoritative repo (ar) should only allow pushes from a limited number of people to ensure we don't accidentally merge bad code into it that is difficult to undo.  

Consider disallowing git push --force

To start with, never allow your ar to receive a git push --force.  The --force overwrites the structure and sequence of commits and will greatly increase the chance that previous commits are lost.  By default git allows --force but can be turned off with git config --system receive.denyNonFastForwards true.  An ar should always deny this.  

It might be valuable to allow this on other repos however.  On local repos there are good use cases for this.  But on any repo where others have had the ability to run a git pull/fetch you WILL cause grief for those developers if you do a git push --force.  Effectively their subsequent git push will have a different HEAD and cause confusion and possibly lost commits.  

Noobs can bad screw up merges if they don't understand git

Assume the following scenario:  

  1. git pull to "get latest" (let's assume you are now at "Version 123")
  2. work and commit changes to FileA to the local repo for the next few days.  
  3. You are ready to integrate your work into the public repo.  You do a
    1. git pull (assume the public repo is at Version 130), which attempts to merge any other changes from Version 124-130 into the local repo for FileA, FileB, and FileC.  
    2. It takes you a few hours to handle the merge conflicts in FileA...FileB and FileC you have not changed so you safely IGNORE those changes.  You are now ready to git push.   
  4. Do another git pull (public repo at Version 131), which attempts another merge.  
  5. git commit
  6. git push

This is known as the "merge workflow" and it is the most popular way to use git in a multi-committer environment.  

At Step 5 you will see a git message that says FileA is conflicted and FileB and FileC are changed.  

If we were using svn we would perform a svn commit on FileA ONLY since we made no changes to B and C...those were made by others.  This is because svn is a "file-based VCS".  git is a "snapshot-based" VCS so we want to commit A,B,C even though we didn't change B or C.  Why?  Because we wish to commit the state of the snapshot and don't concern ourselves with the fact that we are committing files we didn't change.  

That's counter-intuitive to svn experts.  If you use gitgui or git bash this is all done for you and difficult to screw up.  You simply see staged changes that you commit...there is no option to "remove" a change from the changeset.  But if you use other gui tools you may see options to remove changes you haven't made...just like you see in svn.  Resist the urge to overthink this.  Just commit and push the snapshot as it exists as you tested.  In svn it is common for a developer to scan all files being committed and NOT commit them because they were already committed and because the developer knows he didn't change them. If you do that in git you are effectively hitting the UNDO button on other changes between Version 123 and current for files you have not changed.  

The "lost" changes still exist in git, just in a different commit that you'll need to re-merge...usually after you notice bugs and missing features after a few builds.  This behavior is totally different from any other VCS I've ever worked with including ClearCase, VSS, TFS, SVN, or Perforce.  Perhaps Mercurial or Bazaar or other VCSs I'm not aware of work similar to git...but this causes a lot of grief for devs that try to make git work like svn.  

Solutions to overcome this merge problem:

  • Education works best.  Tell your devs to not overthink it.  Push the snapshot that you tested locally.  Let git manage what to push.  We also run a CIT process on the non-authoritative repos and check for discarded merges (by running unit tests) before we allow pushes to the ar.  
  • git pull often.  Very often.
  • Try to avoid having multiple devs working on the same files/sections of code.  Easier said than done. 
  • Avoid merging, which is contrary to git best practices I know.  Instead, use rebase as much as possible.  Specifically, we ensure all commits are rebased locally so new pushed commits are always on top of other devs' commits on HEAD.  Merging is much reduced then and all of the smaller, local commits are not propograted to the public repo which adds additional confusion for people scanning the git logs (which is why I like git squash which is a type of rebase actually).  
  • Do what github does.  Limit who may commit to an ar.  Here is the github workflow...called the "gatekeeper model" or "maintainer model":
    • only a "maintainer" may push to the important branches on the ar.  The maintainer should be intimately aware of the source code AND the idiosyncracies of git merging (such as the issue I've seen above).
    • users clone the ar at anytime and can push to the non-ar at will.  When all tests pass on the non-ar...
    • the dev issues a "pull request" to the maintainer.  The maintainer pulls the changes into the ar from the dev's repo (or, likely, the non-ar).  
    • The maintainer can review the pull request and ensure that the merge is clean.  If not the maintainer can fix it, or more likely send it back to the dev for rework.  
    • Now if anything breaks in the ar it is the maintainer's problem...not a git noob that just discarded a bunch of commits.   

If you ever have a symptom where it appears as though git "lost your work" it is probably due to discarded merges.  

General rule with git:  If you scratch your head wondering why git is trying to merge and push more files than you know you've edited you should probably juist push the files anyway!  That seems counter-intuitive but it works if you remember that git is snapshot-based vs file-based.  (Perhaps the better rule is to seek help from others first).  

Said differently, when you do a git pull on top of your work you MUST then push those changes back up or they are undone/lost.  

You have just read "One Last Git Gotcha:  The Merge Pattern" on davewentzel.com. If you found this useful please feel free to subscribe to the RSS feed.