Said differently, if I see 500 lines of R or python code about 10% (50 lines) will actually be data wrangling code that can’t be done in SQL or direct calls to manipulate the model. ie, this is code that really MUST be done in a vectorized language.
PRO TIP: SQL is a more scalable, faster method to do ETL than any vectorized language. To get your python/pandas/R/dataframe code to scale, try to do as much in the data tier as you can.
What these customers generally need is someone that can read (and understand) R/python/SAS and can do general process improvement.
That’s it. That’s Use Case 1.
The above is the most common use case.
Data scientists, in general, are not trained to be SQL developers. They are really, really good at Set Theory, and manipulating a pandas dataframe is semantically identical (and syntactically similar) to SQL…they just aren’t comfortable doing it. And generally any SQL developer likely is scared off by R and python. They shouldn’t be, these skills are easy to learn.
It’s generally easier to have a SQL developer learn R than it is to have an R developer learn SQL. Vectorized language developers seem to want to fallback to bad habits quickly. A SQL developer tends to look holisticly at the code and figure out how to express it better in a declarative language.
These are fun engagements for all parties. Everyone learns and everyone is happy. I’ve taken SAS processes that run for 48 hours and have compacted them down to just a few mins when I pre-process the data frame using SQL.
I actually had a client where one of their data scientists ran all of the “production” analytics on his laptop using SAS. If he went on vacation, reports were delayed. They tried putting his code on a VM but could never get all of the SAS dependencies to work.
His laptop was getting old.
Management was getting concerned.
He went on vacation once and they actually wrapped his laptop in foam padding and duct tape.
I’m not kidding.
This isn’t the first time I’ve seen this.
Data scientists tend not to make great developers. They tend to run their analytics on underpowered laptops where they are also running Slack, Outlook, watching youtube, etc. They don’t always have their code checked into git, instead preferring to backup their code to a network share.
It’s often difficult to get R or python loaded and dependencies resolved without a lot of trial and error.
Microsoft provides an excellent generic solution called the DSVM (Data Science Virtual Machine). It comes with a lot of the latest “data science” tooling installed. A data scientist can spin this up, do their analytics, and if they break an installation they just destroy the VM and build a new one.
This is a great solution, if you use Azure. But the general idea is perfect. Show a data scientist how to use Docker and put their tooling there. It’s easy to take a dockerfile and replicate it for a prod environment and give it the additional compute and memory it needs to scale.
The hard part is getting that RStudio code to run in an automated, headless fashion. But that’s part of the solution.
Usually this is enough to get a process to scale from a laptop/dev environment to a production environment. But not always.
Generally this starts where the data scientist uses R or python on a subset of data and then when they try to process the data frame on the production data set they get out-of-memory errors.
Vectorized languages like SAS, R, and python tend not to scale-up. And many times the ML algorithm chosen can’t really be scaled OUT either. If you use R then you have the ability to use MS Machine Learning Server (nee Revo ScaleR) which can handle this much better. But it’s pricey and requires some code refactoring. If you use python/pandas you have more of a problem.
Generally, the solution is something like Hadoop/HDFS and an R Server or pySpark. But again, this requires code refactoring. For example, python data scientists LOVE using pandas. But pandas works horribly on Spark. You really need to remove your pandas data frames and use Spark-native data frames. This requires expertise to help your data scientists understand this new processing paradigm. But even then they may need to learn Scala and ditch pySpark entirely.
Spark code is faster and more versatile using Scala than pySpark. You’ll quickly find that simple things like data partitioning are implemented much better in Scala on Spark than with pySpark.
These are just a few case studies I’ve seen recently where I help data science teams to scale their current processes. None of this is particularly difficult work. It’s all just process improvement and teaching your staff how to try new things.
Engage with me for your next assignment ›
Thanks for reading. If you found this interesting please subscribe to my blog.
Dave Wentzel CONTENT
data science azure data architecture case study