Azure Databricks Demos

A great Azure managed Spark offering, now with a few good demos

Overview

Databricks is “managed Spark” that prior to the start of 2018 was hosted exclusively on AWS. Spark is an Apache project that eliminates some of the shortcomings of Hadoop/MapReduce. It’s a Big Data processing engine basically. You write your code in a language like Scala, python, or even SparkSQL.

Traditionally, learning Spark was cumbersome. Before you could begin ingesting your data or learning pyspark you needed to configure a spark (hadoop) cluster.

That’s not entirely true. The fact is that Spark can run in ‘standalone’ mode on a single node. You don’t get any of the goodness of DAGs and distributed data, but it’s a good start

I did a quick post on “What is Spark?” if it’s new to you.

HD Insight on Azure radically simplified the Spark setup learning curve. All you did was spin up a HDI cluster, wait about 40 minutes, and you were ready to log in.

But unfortunately you were then presented with a bash shell to run pyspark or sparkshell (scala) or you could start a Jupyter notebook. Developer, especially data scientists, liked the usability of the Jupyter notebook experience. Invariably, everyone wrote their code there.

But what happens when that code needs to be scheduled to run off-hours? And what if you only need your Spark cluster for a few hours a night for that batch processing? You needed to copy the relevant code from the ipynb file into a shell script so it could be scheduled with pyspark (or whatever).

Databricks to the Rescue

Databricks solved most of these problems by managing the Spark cluster for you. You simply spin up Databricks, log in to their notebook experience (which is similar to Jupyter) and begin writing your code. jupyter notebooksIt takes about 5 minutes to spin up a Databricks instance. Notebook code can also be “scheduled” which cuts down on rework needed to get your Spark code into a shell script.

The big problem…Databricks only ran on AWS. Not anymore. Beginning around the end of 2017 Databricks introduced an Azure offering.

Here’s the next problem…while Azure Databricks is now a thing, there are no Azure-specific demos. Each demo I’ve seen, even the ones on Microsoft’s own site, utilize Amazon S3 to store datasets. This makes basically setup and data ingestion a bit difficult

So I wrote my own demos that I use for customers.

Here’s a link to my git repo

The Demos

First, here’s a quick overview to help you determine if Databricks/Spark is right for you.

Working Code

I have created a basic notebook demo using python. With a working cluster you can download and install my demo and begin doing basic data discovery. I show you how data scientists approach new datasets they’ve never seen before by exploring some real data that I have in Azure. You’ll learn about the SparkContext, SQLContext, and SparkSession.

We’ll also explore the sample datasets Databricks provides for you. You’ll learn how to explore your data in python, sparksql, display graphs, and save datasets to an optimized parquet format.

Customer Churn

The second demo explores something more real-world…investigating customer churn at a telco. Click the link to learn more about it.

Customer Churn

For this demo I use a Kaggle dataset. You will likely need to upload the dataset (it’s small) to WASB or ADLS (instructions are in the git repo). I discuss how to use both, but my suggestion is to use WASB. All documentation is inline in the notebook, just download it and re-upload it to your databrick environment.

We again start just like a typical data scientist would begin. We explore and learn the “shape” of the data by “inferring” the schema and loading a small subset to visualize. I’ll show you how to switch from python to sparksql whenever you want to change environments.

We’ll then use a classifier algorithm to predict whether a customer will churn. We’ll identify the label (what we are trying to predict) and we’ll decide which columns make the best features for the predictions.

How did our algo perform? We’ll look at the confusion matrix and accuracy to see how predictive our model really is.

Next Steps

I’d really like to create some demos around how to partition large datasets in Spark/Databricks. Right now performance can be pitiful on YUGE datasets that are persisted to HDFS/WASB/ADLS and have a different partitition scheme than is required for the query patterns. This involves reading and understanding DAG query plans. I find these topics to be fascinating.

Good luck


Thanks for reading. If you found this interesting please subscribe to my blog.