Databricks natively stores it’s notebook files by default as DBC files, a closed, binary format. A .dbc file has a nice benefit of being self-contained. One dbc file can consist of an entire folder of notebooks and supporting files. But other than that, dbc files are frankly obnoxious.
- I can’t view them outside of the Databricks notebook/workspace UI
- The native nbviewer in github doesn’t recognize them (nbviewer is what allows github to view ipynb files (Jupyter notebook files))
- they are binary (that makes git diffs impossible)
- ipynb files are json and I posit that they actually aren’t human-readable, but I can at least do a modicum of diffing/merging if I needed to.
The git integration in the Databricks UI is passable, but lacking. One example, each notebook must be saved as a separate commit even though any given feature/bug may span multiple notebooks.
So, what’s the solution?
I’ve never seen this published before, but I was poking around the Databricks CLI and noticed it can actually do all of this for you. Here’s a couple of sample scripts that demonstrate some methods of doing notebook lifecycle using the CLI:
pip3 install --upgrade databricks-cli databricks --version # let's use a personal access token # databricks|user settings|access tokens # https://eastus2.azuredatabricks.net databricks configure --token databricks workspace -h databricks workspace ls /Usersemail@example.com databricks workspace ls /Usersfirstname.lastname@example.org/OpenHack # default format is SOURCE, also the only(???) format for export_dir databricks workspace export_dir /Usersemail@example.com/OpenHack . databricks workspace export --format JUPYTER /Usersfirstname.lastname@example.org/OpenHack/01_useful_functions .
All we really need to do is download our notebooks with a little sh script and then call
git commit, etc, as needed.
Are you convinced your data or cloud project will be a success?
Most companies aren’t. I have lots of experience with these projects. I speak at conferences, host hackathon events, and am a prolific open source contributor. I love helping companies with Data problems. If that sounds like someone you can trust, contact me.
Thanks for reading. If you found this interesting please subscribe to my blog.
- Forrester 2020 Enterprise Architecture Award Winner
- MTC MLOps Workshops and Hackathons
- AI Envisioning Workshop at Philadelphia Azure DataFest
- Convert Databricks DBC notebook format to ipynb
- New Presentations and Workshops
Dave Wentzel CONTENT
azure data lake data science