“Help us scale our data science team”
I’ve had at least 4 customers that have asked me this exact question.
Generally I see mid-size companies that have some data scientists on staff but they are struggling with their predictive analytics either because:
- they need their algorithms to work predictively on data coming through their systems and what they have now is reactive analytics coming from “yesterday’s” data (velocity problem)
- they need help scaling their algorithms based on volumes of data (volume problem)
This Case Study will have a lot of generalizations. But I don’t think I’ve OVER generalized. YMMV.
When I take a look at their R, python, or SAS code, here is what I generally see:
- connect to database and read out table or view to a csv file
- load and join multiple csv files into a data frame
- perform other data wrangling (nee ETL) tasks to build model features
- run data frame through the model and generate new data (predictions)
- save new data to csv
- ask DBA to load csv data back into database
Said differently, if I see 500 lines of R or python code about 10% (50 lines) will actually be data wrangling code that can’t be done in SQL or direct calls to manipulate the model. ie, this is code that really MUST be done in a vectorized language.
PRO TIP: SQL is a more scalable, faster method to do ETL than any vectorized language. To get your python/pandas/R/dataframe code to scale, try to do as much in the data tier as you can.
What these customers generally need is someone that can read (and understand) R/python/SAS and can do general process improvement.
That’s it. That’s Use Case 1.
Use Case 1: Process Optimization
The above is the most common use case.
Data scientists, in general, are not trained to be SQL developers. They are really, really good at Set Theory, and manipulating a pandas dataframe is semantically identical (and syntactically similar) to SQL…they just aren’t comfortable doing it. And generally any SQL developer likely is scared off by R and python. They shouldn’t be, these skills are easy to learn.
It’s generally easier to have a SQL developer learn R than it is to have an R developer learn SQL. Vectorized language developers seem to want to fallback to bad habits quickly. A SQL developer tends to look holisticly at the code and figure out how to express it better in a declarative language.
These are fun engagements for all parties. Everyone learns and everyone is happy. I’ve taken SAS processes that run for 48 hours and have compacted them down to just a few mins when I pre-process the data frame using SQL.
Use Case 2: “Our data scientists can’t run the process on their laptops anymore”
I actually had a client where one of their data scientists ran all of the “production” analytics on his laptop using SAS. If he went on vacation, reports were delayed. They tried putting his code on a VM but could never get all of the SAS dependencies to work.
His laptop was getting old.
Management was getting concerned.
He went on vacation once and they actually wrapped his laptop in foam padding and duct tape.
I’m not kidding.
This isn’t the first time I’ve seen this.
Data scientists tend not to make great developers. They tend to run their analytics on underpowered laptops where they are also running Slack, Outlook, watching youtube, etc. They don’t always have their code checked into git, instead preferring to backup their code to a network share.
It’s often difficult to get R or python loaded and dependencies resolved without a lot of trial and error.
Microsoft provides an excellent generic solution called the DSVM (Data Science Virtual Machine). It comes with a lot of the latest “data science” tooling installed. A data scientist can spin this up, do their analytics, and if they break an installation they just destroy the VM and build a new one.
This is a great solution, if you use Azure. But the general idea is perfect. Show a data scientist how to use Docker and put their tooling there. It’s easy to take a dockerfile and replicate it for a prod environment and give it the additional compute and memory it needs to scale.
The hard part is getting that RStudio code to run in an automated, headless fashion. But that’s part of the solution.
Usually this is enough to get a process to scale from a laptop/dev environment to a production environment. But not always.
Use Case 3: “We have so much data that R or python can never scale”
Generally this starts where the data scientist uses R or python on a subset of data and then when they try to process the data frame on the production data set they get out-of-memory errors.
Vectorized languages like SAS, R, and python tend not to scale-up. And many times the ML algorithm chosen can’t really be scaled OUT either. If you use R then you have the ability to use MS Machine Learning Server (nee Revo ScaleR) which can handle this much better. But it’s pricey and requires some code refactoring. If you use python/pandas you have more of a problem.
Generally, the solution is something like Hadoop/HDFS and an R Server or pySpark. But again, this requires code refactoring. For example, python data scientists LOVE using pandas. But pandas works horribly on Spark. You really need to remove your pandas data frames and use Spark-native data frames. This requires expertise to help your data scientists understand this new processing paradigm. But even then they may need to learn Scala and ditch pySpark entirely.
Spark code is faster and more versatile using Scala than pySpark. You’ll quickly find that simple things like data partitioning are implemented much better in Scala on Spark than with pySpark.
These are just a few case studies I’ve seen recently where I help data science teams to scale their current processes. None of this is particularly difficult work. It’s all just process improvement and teaching your staff how to try new things.
Thanks for reading. If you found this interesting please subscribe to my blog.
- Case Study: Scaling Your Data Scientist
- New Book: SQL Server 2017 Machine Learning Services with R
- Self-Service BI
- Upcoming Data Science Presentation
- Practical DevOps Using Azure DevTest Labs
data science azure data architecture case study