It takes about 5 minutes to spin up a Databricks instance. Notebook code can also be “scheduled” which cuts down on rework needed to get your Spark code into a shell script.
The big problem…Databricks only ran on AWS. Not anymore. Beginning around the end of 2017 Databricks introduced an Azure offering.
Here’s the next problem…while Azure Databricks is now a thing, there are no Azure-specific demos. Each demo I’ve seen, even the ones on Microsoft’s own site, utilize Amazon S3 to store datasets. This makes basically setup and data ingestion a bit difficult
So I wrote my own demos that I use for customers.
First, here’s a quick overview to help you determine if Databricks/Spark is right for you.
I have created a basic notebook demo using python. With a working cluster you can download and install my demo and begin doing basic data discovery. I show you how data scientists approach new datasets they’ve never seen before by exploring some real data that I have in Azure. You’ll learn about the SparkContext
, SQLContext
, and SparkSession
.
We’ll also explore the sample datasets Databricks provides for you. You’ll learn how to explore your data in python
, sparksql
, display graphs, and save datasets to an optimized parquet
format.
The second demo explores something more real-world…investigating customer churn at a telco. Click the link to learn more about it.
For this demo I use a Kaggle dataset. You will likely need to upload the dataset (it’s small) to WASB or ADLS (instructions are in the git repo). I discuss how to use both, but my suggestion is to use WASB. All documentation is inline in the notebook, just download it and re-upload it to your databrick environment.
We again start just like a typical data scientist would begin. We explore and learn the “shape” of the data by “inferring” the schema and loading a small subset to visualize. I’ll show you how to switch from python
to sparksql
whenever you want to change environments.
We’ll then use a classifier algorithm to predict whether a customer will churn. We’ll identify the label
(what we are trying to predict) and we’ll decide which columns make the best features for the predictions.
How did our algo perform? We’ll look at the confusion matrix
and accuracy to see how predictive our model really is.
I’d really like to create some demos around how to partition large datasets in Spark/Databricks. Right now performance can be pitiful on YUGE datasets that are persisted to HDFS/WASB/ADLS and have a different partitition scheme than is required for the query patterns. This involves reading and understanding DAG query plans. I find these topics to be fascinating.
Good luck
Thanks for reading. If you found this interesting please subscribe to my blog.
Dave Wentzel CONTENT
azure data science hadoop data architecture