DaveWentzel.com            All Things Data

Data Architecture

What do you use for your work machine?

It used to be that companies didn't offer laptops unless the employee was a road warrior.  Then laptops became ubiquitous and desktops/workstations were not given out.  What was old is new again...I see the trend back to using workstations for developer-class machines and then also issuing a low-end laptop/netbook for traveling, home, and conference room use.  There are many reasons, but primarily it's cost-prohibitive to get a 64 bit laptop with enough RAM to develop modern web-based apps against.  A workstation is cheaper and better equipped.  The netbook is then also issued mainly to remote-in to the workstation.  I really like this method since I don't need to power up/down, which takes forever when you run Windows 7, VS, and SQL.  In fact, I run PeppermintOS on the netbook which boots in under 5 secs and can rdp directly to my workstation using our VPN software.  It's nice to be able to quickly connect at home at night to get some quick work done without having to power up, fire up VS, etc etc.  

Using PeppermintOS it's also possible to recycle older laptops that can't run Win7 so that they can still rdp to the workstation and do basic web and email.  The tree-hugger in me likes this.  

An alternative to workstations...some people are using VMs and then connecting to those VMs with the netbook/laptop.  This is a good solution, but the libertarian in me prefers a dedicated workstation.  However, the ability to save a VM image after I have it configured to my needs is appealing.  Company's seem to be getting "cheap" with the VM hosts/hyper-visors...using tricks like memory ballooning and CPU overcommittment to get more VMs on a physical host.  I don't like this trend when it is being done blindly without concern for performance, productivity, and end-user need, so I prefer the dedicated workstation.  

What do you use? What is your preference?

Data deduplication for the data architect

Data deduplication is definitely one of the hot IT buzzwords right now.  Data deduplication is a data compression method for removing redundant data from persisted storage.  The typical example is a large pdf attached to an email and sent to 20 people, resulting in 20 copies of the file on the Exchange Server.  Data deduplication techniques save one copy of the file with pointers to the others.  There are many, many other methods of doing this.

What is this NoSQL thing?

You've probably seen NoSQL being bandied about on the web and in the IT press.  Whenever we see a "new" technology trend get repeated press coverage we know we need to bone up on that hot new thing.  First of all there is a NoSQL RDBMS that runs on Unix, but this isn't what is being hyped right now.  The "NoSQL" being hyped is really just a concept that states that we can store data without using an RDBMS, the SQL language, or even relational modeling!  

What is it?

What is a Data Modeler?

It seems like there are a lot of job postings for data architects that require "data modeling" skills.  They then usually ask for x years experience with ERWin.  I think this is a very superficial look at what data modeling really is.  This is my list of data modeling skills, in order from most to least important:

How to save a *lot* of money with SQL Server Express

Many years ago I worked for a company that wanted to spend ZERO money on SQL Server CALs.  Unfortunately the COTS they wanted to deploy would likely require at least 20 concurrent users.  I wrote a program that sat on the SQL Server intercepting requests and forwarding them to SQL and then doing the same thing in reverse for the responses. I never did quite get it to work perfectly but imagine how much money that could've saved on CALs!!!  (Granted, that's a support nightmare waiting to happen).  Microsoft has since said that such multiplexing arrangements do not alleviate the need for CALs (darn).  

Many people believe software licensing costs are only going to rise.  I'm not in that camp.  First, we have open-source projects.  Next, it seems like vendors are actually "giving away" scaled-down versions of their software in an effort to hook you in and upgrade to the costlier editions.  From an economics perspective, technology is one of the few sectors of the economy in the last 50 years where prices continue to drop in real terms (inflation-adjusted) as quality and productivity improve.  It has been argued by many economists that the entire productivity boom/low inflation environment of the 1990's was driven entirely by the technology sector, and was proved true when the dot com bubble burst at the turn of the century.  

Back to SQL Server and saving money.  On my old blog my article Express vs Standard was both popular and controversial.  Using totally legal tricks in SQL Express I showed how many SQL instances/databases did not warrant the expense of a Standard Edition license and Express was sufficient.  The biggest limitation to adoption of Express in many cases is the db size limitation of 4GB.  You'll be happy to know that the limit has increased to 10GB now in SQL 2008 R2.  I also cover some cool tricks using Synonyms to eliminate location affinity in your db which entirely removes the db size limitations of Express legally.  Some then argue that the cache limitations of Express will cripple a production deployment.  I cover some ways to overcome this too, again legally.  

Go read the article.  

Doing Database CIT via TFS

Continuous Integration Testing (CIT) is an agile concept where you are constantly testing your code checkins to ensure they build, compile, and have successful unit tests.  Notification of failures for smaller shops is usually just an email to an SCM who then tracks down what broke using the version control system and notifies that person.  In larger shops a special CIT machine will sit in the scrum room with a big red or green light indicating if the builds are passing CIT.  When it goes red someone usually steps up to determine cause. 

In most shops this CIT concept is a no-brainer for the application code (tools like Apache Ant and CruiseControl really help with the automation and scheduling) , but is rarely done for the database scripts.  One reason is that most applications have a database build process that doesn't lend itself to CIT-style processing easily (for instance, requiring db-level object diffs to determine what needs to be applied).  My database build process is more of an evolutionary process, so by definition it does lend itself to CIT and change quite nicely.  

This post is the next in a series of posts showing useful TFS tips for the data professional.  I'll cover how I created a very simple script that handles database CIT using TFS as the source control repository.  You could also plug in a different VCS quite easily. It uses vbscript and batch files but PowerShell or a compiled application would work just as well, the concept is the important thing.   

I have the actual files available here for download.  Here is the process:

A controlling batch file (called RunCIT.bat) is scheduled using AT.exe or similar scheduler, this handles the automation component.  Let's look at some of the interesting points of the batch file...

Lines 1 - 14...the batch file has a series of parameters that tell it which server to run the db scripts against, user/pwd, location of the logs, and the TFS connection string.  

It then builds a loop (line 19) and checks  "branches.txt" (a csv file) for a list of branches that we want to run CIT against.  In our environment we have at least 5 active code/release branches at any time so I want to run CIT against all branches.  Our SCM can change this file whenever a new branch is created or deprecated.

As a side note, if you've ever tried to do looping in a batch file, it's a real PITA.  If you've ever tried to expand your DOS variables within that loop you know it's almost impossible.  I found a nifty workaround to make this happen using "SETLOCAL ENABLEDELAYEDEXPANSION" and alternative variable naming schemes.  See Lines 3 and 27 for examples. 

Within the loop we then run a series of tf.exe commands to Get Latest for the source code for that branch.  This means that your build machine needs to have Visual Studio loaded on it (unless someone knows how to get the tf.exe commands to run without it).  

The scripts for the given branch are then executed, first against a "net new" database and then a "ported" (upgraded) database, each is set up via manipulate_db.v-b-s.  I've found that naming a vbscript file as anything other than .vbs and then calling it with cscript/wscript is an easy way to bypass UAC on newer Windows OSs

Why run CIT against a "new" and "upgraded" database?

Unlike C# code, where I simply need to test if I compile and my unit tests succeed, a db build is a little more tricky because I need to be concerned about data history and upgrading data.  If I'm changing a UNIQUE INDEX from ColA,ColB to just ColA I can't just CREATE INDEX WITH DROP_EXISTING since it's possible an existing db will have violations.  I need to test that.  If I'm adding ColC to a table I need to remember to add that to the CREATE TABLE statement (for new databases) and an appropriate ALTER for upgraded dbs.  In each case the "new" and "upgraded" database take different codepaths through the sql scripts. 

So your upgraded or "ported" db should probably be a customer db with some substantial data.  You can even run against multiple upgraded dbs, perhaps one from each version of your software that you may have to upgrade. 



But the key is Line 73, GenerateDiscrepancyReport.vbs.  This compares netnew vs ported dbs and reports on missing (or schematically diferent) tables, views, columns, indexes discrepancies (different cluster keys, cols), and it will even report any data changes if you have defined system or master data tables that you ship with your db.  This is handled via 2 stored procs (here and here). 

Email is sent on success and failure with the logs attached. 

Happy Testing!!

Data Virtualization

For some reason "data virtualization" is one of the new hot buzzwords in my industry.  Data virtualization is really just combining a bunch of data sources (databases, unstructured data, DWs, weblogs, whatever) into a single virtual data tier layer that provides access to applications.  Haven't we been doing this for years.  I remember when linked servers came out for SQL Server.  I thought they were the coolest thing at the time.  I could now connect to Oracle from my SQL Server and manipulate the data in the more familiar TSQL, realtime, without the need of an ETL process.

column-oriented databases

I briefly wrote about these a few weeks ago in this blog post.  I believe Michael Stonebraker is credited with "inventing" this technology, although it's really been around since the 1970's.  I was tasked at evaluating column-oriented databases after management read about these in a trade journal and thought it might be applicable for our fast-growing analytics system.  He's possibly right. Fast data access is the cornerstone.  

What are these things?

Whatever it is, I'm against it

I'm told at work that I'm adverse to new technologies.  Well, here's what me and Groucho have to say about that.  

column-oriented databases

I briefly wrote about these a few weeks ago in this blog post.  I believe Michael Stonebraker is credited with "inventing" this technology, although it's really been around since the 1970's.  I was tasked at evaluating column-oriented databases after management read about these in a trade journal and thought it might be applicable for our fast-growing analytics system.  He's possibly right. Fast data access is the cornerstone.  (read more here)

Pages

Subscribe to RSS - Data Architecture