Databricks is “managed Spark” that prior to the start of 2018 was hosted exclusively on AWS. Spark is an Apache project that eliminates some of the shortcomings of Hadoop/MapReduce. It’s a Big Data processing engine basically. You write your code in a language like Scala, python, or even SparkSQL.
Traditionally, learning Spark was cumbersome. Before you could begin ingesting your data or learning
pyspark you needed to configure a spark (hadoop) cluster.
That’s not entirely true. The fact is that Spark can run in ‘standalone’ mode on a single node. You don’t get any of the goodness of DAGs and distributed data, but it’s a good start
I did a quick post on “What is Spark?” if it’s new to you.
HD Insight on Azure radically simplified the Spark setup learning curve. All you did was spin up a HDI cluster, wait about 40 minutes, and you were ready to log in.
But unfortunately you were then presented with a bash shell to run
sparkshell (scala) or you could start a
Jupyter notebook. Developer, especially data scientists, liked the usability of the Jupyter notebook experience. Invariably, everyone wrote their code there.
But what happens when that code needs to be scheduled to run off-hours? And what if you only need your Spark cluster for a few hours a night for that batch processing? You needed to copy the relevant code from the
ipynb file into a shell script so it could be scheduled with
pyspark (or whatever).
Databricks to the Rescue
Databricks solved most of these problems by managing the Spark cluster for you. You simply spin up Databricks, log in to their notebook experience (which is similar to Jupyter) and begin writing your code. It takes about 5 minutes to spin up a Databricks instance. Notebook code can also be “scheduled” which cuts down on rework needed to get your Spark code into a shell script.
The big problem…Databricks only ran on AWS. Not anymore. Beginning around the end of 2017 Databricks introduced an Azure offering.
Here’s the next problem…while Azure Databricks is now a thing, there are no Azure-specific demos. Each demo I’ve seen, even the ones on Microsoft’s own site, utilize Amazon S3 to store datasets. This makes basically setup and data ingestion a bit difficult
So I wrote my own demos that I use for customers.
- Creating a Databricks instance in Azure just in case you’ve never done it.
- Navigating the Databricks Workspace. I’ll show you how to navigate your workspace and teh Spark cluster UI.
I have created a basic notebook demo using python. With a working cluster you can download and install my demo and begin doing basic data discovery. I show you how data scientists approach new datasets they’ve never seen before by exploring some real data that I have in Azure. You’ll learn about the
We’ll also explore the sample datasets Databricks provides for you. You’ll learn how to explore your data in
sparksql, display graphs, and save datasets to an optimized
The second demo explores something more real-world…investigating customer churn at a telco. Click the link to learn more about it.
For this demo I use a Kaggle dataset. You will likely need to upload the dataset (it’s small) to WASB or ADLS (instructions are in the git repo). I discuss how to use both, but my suggestion is to use WASB. All documentation is inline in the notebook, just download it and re-upload it to your databrick environment.
We again start just like a typical data scientist would begin. We explore and learn the “shape” of the data by “inferring” the schema and loading a small subset to visualize. I’ll show you how to switch from
sparksql whenever you want to change environments.
We’ll then use a classifier algorithm to predict whether a customer will churn. We’ll identify the
label (what we are trying to predict) and we’ll decide which columns make the best features for the predictions.
How did our algo perform? We’ll look at the
confusion matrix and accuracy to see how predictive our model really is.
I’d really like to create some demos around how to partition large datasets in Spark/Databricks. Right now performance can be pitiful on YUGE datasets that are persisted to HDFS/WASB/ADLS and have a different partitition scheme than is required for the query patterns. This involves reading and understanding DAG query plans. I find these topics to be fascinating.
Thanks for reading. If you found this interesting please subscribe to my blog.
- Azure Databricks Demos
- I'm Speaking at 'Philadelphia Azure DataFest: Advanced Analytics and Big Data Conference' on May 11
- So you are starting a blockchain project...
- Cloud Vendor Lock-in
- Case Study: Scaling Your Data Scientist
azure data science hadoop data architecture