What is Azure Databricks?

Azure Databricks is a data analytics platform that provides powerful computing capability, and the power comes from the Apache Spark cluster. In addition, Azure Databricks provides a collaborative platform for data engineers to share the clusters and workspaces, which yields higher productivity. Azure Databricks plays a major role in Azure Synapse, Data Lake, Azure Data Factory, etc., in the modern data warehouse architecture and integrates well with these resources.

Data engineers and data architects work together with data and develop the data pipeline for data ingestion with data processing. All data engineers work in a sandbox environment, and when they have verified the data ingestion process, the data pipeline is ready to be moved to Dev/Staging and Production.

Manually moving the data pipeline to staging/production environments via Azure portal will potentially introduce the difference in environments and add a tedious task to repeat manual processes in multiple environments. Automated deployment with service principal credentials is the only solution to move all your work to higher environments. There will be no privilege to configure via the Azure portal as a user. As data engineers complete the data pipeline, Cloud automation engineers will use IaC (Infrastructure as Code) to deploy all Azure resources and configure them via the automation pipeline. That includes all data related to Azure resources and Azure Databricks.

Data engineers work in Databricks with their user account, and it works very well integrating Azure Databricks with Azure key vault using key vault secret scope. All the secrets are persisted in key vault, and Databricks can get the secret value directly via linked service. Databricks uses user credentials to go against Keyvault to get the secret values. This does not work with service principal (SPN) access from Azure Databricks to the key vault. This functionality is requested but not yet there as per this GitHub issue.

Passionate about data? Check out our open data careers and apply to join our quickly growing team today!

Let’s Look at a Scenario

The data team has given automation engineers two requirements:

  • Deploy an Azure Databricks, a cluster, a dbc archive file which contains multiple notebooks in a single compressed file (for more information on dbc file, read here), secret scope, and trigger a post-deployment script.
  • Create a key vault secret scope local to Azure Databricks so the data ingestion process will have secret scope local to Databricks.

Azure Databricks is an Azure native resource, but any configurations within that workspace is not native to Azure. Azure Databricks can be deployed with Hashicorp Terraform code. For Databricks workspace-related artifacts, the Databricks provider needs to be added. For creating a cluster, use this implementation. If you are only uploading a single notebook file for creating a notebook, then use Terraform implementation like this. If not, there is an example below to use Databricks CLI to upload multiple notebook files as a single dbc archive file. The link to my GitHub repo for complete code is at the end of this blog post.

Terraform implementation

terraform {
  required_providers {
    azurerm = "~> 2.78.0"
    azuread = "~> 1.6.0"
    databricks = {
      source = "databrickslabs/databricks"
      version = "0.3.7"

  backend "azurerm" {
    resource_group_name  = "tf_backend_rg"
    storage_account_name = "tfbkndsapoc"
    container_name       = "tfstcont"
    key                  = "data-pipe.tfstate"

provider "azurerm" {
  features {}

provider "azuread" {

data "azurerm_client_config" "current" {

// Create Resource Group
resource "azurerm_resource_group" "rgroup" {
  name     = var.resource_group_name
  location = var.location

// Create Databricks
resource "azurerm_databricks_workspace" "databricks" {
  name                          = var.databricks_name
  location                      = azurerm_resource_group.rgroup.location
  resource_group_name           = azurerm_resource_group.rgroup.name
  sku                           = "premium"

// Databricks Provider
provider "databricks" {
  azure_workspace_resource_id = azurerm_databricks_workspace.databricks.id
  azure_client_id             = var.client_id
  azure_client_secret         = var.client_secret
  azure_tenant_id             = var.tenant_id

resource "databricks_cluster" "databricks_cluster" {
  depends_on              = [azurerm_databricks_workspace.databricks]
  cluster_name            = var.databricks_cluster_name
  spark_version           = "8.2.x-scala2.12"
  node_type_id            = "Standard_DS3_v2"
  driver_node_type_id     = "Standard_DS3_v2"
  autotermination_minutes = 15
  num_workers             = 5
  spark_env_vars          = {
    "PYSPARK_PYTHON" : "/databricks/python3/bin/python3"
  spark_conf = {
    "spark.databricks.cluster.profile" : "serverless",
    "spark.databricks.repl.allowedLanguages": "sql,python,r"
  custom_tags = {
    "ResourceClass" = "Serverless"

GitHub Actions workflow with Databricks CLI implementation

    needs: [terraform]
    name: 'Databricks Artifacts Deployment'
    runs-on: ubuntu-latest

    - uses: actions/checkout@v2.3.4
    - name: Set up Python 3.0
      uses: actions/setup-python@v2
        python-version: 3.0

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip

    - name: Download Databricks CLI
      id: databricks_cli
      shell: pwsh
      run: |
        pip install databricks-cli
        pip install databricks-cli --upgrade

    - name: Azure Login
      uses: azure/login@v1
        creds: ${{ secrets.AZURE_CREDENTIALS }}
    - name: Databricks management
      id: api_call_databricks_manage
      shell: bash
      run: |
        # Set DataBricks AAD token env
        export DATABRICKS_AAD_TOKEN=$(curl -X GET -d "grant_type=client_credentials&client_id=${{ env.ARM_CLIENT_ID }}&resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d&client_secret=${{ env.ARM_CLIENT_SECRET }}" https://login.microsoftonline.com/${{ env.ARM_TENANT_ID }}/oauth2/token | jq -r ".access_token")

        # Log into Databricks with SPN
        databricks_workspace_url="https://${{ steps.get_databricks_url.outputs.DATABRICKS_URL }}/?o=${{ steps.get_databricks_url.outputs.DATABRICKS_ID }}"
        databricks configure --aad-token --host $databricks_workspace_url

        # Check if workspace notebook already exists
        export DB_WKSP=$(databricks workspace ls /${{ env.TF_VAR_databricks_notebook_name }})
        if [[ "$DB_WKSP" != *"RESOURCE_DOES_NOT_EXIST"* ]];
          databricks workspace delete /${{ env.TF_VAR_databricks_notebook_name }} -r

        # Import DBC archive to Databricks Workspace
        databricks workspace import Databricks/${{ env.databricks_dbc_name }} /${{ env.TF_VAR_databricks_notebook_name }} -f DBC -l PYTHON

While the above example shows how to leverage Databricks CLI to do automation operations within Databricks, Terraform also provides richer capabilities with Databricks providers. Here is an example of how to add ‘service principal’ to Databricks ‘admins’ group in workspace using Terraform. This is essential for Databricks API to work when connecting as a service principal.

Databricks Creating Cluster
Databricks cluster deployed via Terraform
Jobs Deployed via Terraform
No Jobs have been deployed via Terraform
Databricks CLI
 Job deployed using Databricks CLI in GitHub Actions workflow
Deployment with Databricks
Job triggered via Databricks CLI in GitHub Actions workflow

Not just Terraform and Databricks CLI, but also Databricks API provides similar options to access Databricks artifacts and manage them. For example, to access the clusters in the Databricks:

  • To access clusters, first, authenticate if you are a workspace user via automation or using service principal.
  • If your service principal is already part of the workspaces admins group, use this API to get the clusters list.
  • If the service principal (SPN) is not part of the workspace, use this API that uses access and management tokens.
  • If you would rather add the service principal to Databricks admins workspace group, use this API (same as Terraform option above to add the SPN).

The secret scope in Databricks can be created using Terraform or using Databricks CLI or using Databricks API!

Databricks with other Azure resources have pretty good documentation, and for automating deployments, these options are essential: learn and use the best option that suits the needs!

Here is the link to my GitHub repo for complete code on using Terraform, Databricks CLI in GitHub Actions! In addition, you can find a bonus learning how to deploy synapse, ADLS, etc., as part of modern data warehouse deployment, which I will cover in my next blog post.

Until then, happy automating!

From C# Developer to DevOps Engineer

Over the last couple of years, I’ve become a DevOps Engineer after having been primarily a C# developer. Instead of primarily C# and SQL, I was now working almost exclusively with JSON, YAML, and PowerShell. While I was very familiar with Visual Studio 2013/2015/2017 and its excellent support for the .NET work I did over the years, I found the experience for building DevOps solutions to be underwhelming. At the time, the Intellisense for Azure Resource Manager (ARM) or Terraform templates, GitLab or Azure DevOps pipelines, and PowerShell was either non-existent or incomplete. In addition, Visual Studio was quite the resource hog when I wasn’t needing all the extras it provides.

Enter Visual Studio (VS) Code

Now, I had downloaded VS Code soon after it was released with the intent to use it at some point, to say I had. However, after seeing Visual Studio Code used in some ARM template videos where snippets were used, I decided to try it out. Like most Integrated Development Environments (IDE), VS Code isn’t truly ready to go right after installation. It’s taken me some time to build up my configuration to where I am today, and I’m still learning about new features and extensions that can improve my productivity. I want to share some of my preferences.

I want to point out a couple of things. First, I’ve been working primarily with GitLab Enterprise, Azure DevOps Services, and the Azure US Government Cloud. Some of these extensions are purely focused on those platforms. Second, I use the Visual Studio Code – Insiders release rather than the regular Visual Studio Code version. I have both installed, but I like having the newest stuff as soon as I can. For this post, that shouldn’t be an issue.


As long as there’s a decent dark color theme, I’m content. The bright/light themes give me headaches over time. VS Code’s default dark theme, Dark+, fits the bill for me.

One of the themes I didn’t know I needed before I stumbled across them was icon themes. I used to have the standard, generic folder and file icons, the Minimal theme in VS Code. That made it difficult to differentiate between PowerShell scripts, ARM templates, and other file types at a glance. There are a few included templates, but I’m using the VSCode Icons Theme. It’s one of the better options, but I’m contemplating making a custom one as this one doesn’t have an icon for Terraform variables files (.tfvars), and I’d like a different icon for YAML files. If the included themes aren’t suitable for you, there are several options for both types of themes and Product Icons themes through the marketplace.

Figure 1 – VS Code’s Minimal icon theme


Workspaces are a collection of folders that are a “collection of one or more folders are opened in a VS Code window.” A workspace file is created that contains a list of the folders and any settings for VS Code and extensions. I’ve only recently started using workspaces because I wanted to have settings configured for different projects.

Extensions in Visual Studio Code provide enhancements to improve productivity. Extensions include code snippets, new language support, debuggers, formatters, and more. I have nearly 60 installed (this includes several Microsoft pre-installs). We will focus on a handful that I rely on regularly.

Workspace Code Configuration
Figure 2 – VS Code Workspace configuration. Also shows the choice of Azure Cloud referenced in the Azure Account extension section below.

Azure Account

The Azure Account extension provides login support for other Azure extensions. By itself, it’s not flashy, but there are a few dozen other Azure extensions that can use the logged-on account from one to reference Azure resources targeted by the others. This extension has a setting, Azure Cloud, that was the main reason I started adopting Workspaces. The default is the commercial version, AzureCloud. I’ve changed it at the user level to AzureUSGoverment, but some of my recent projects use AzureCloud. I’ve set the workspace setting for those.

Azure Resource Manager (ARM) Tools

This extension will make your ARM template tasks much more manageable! It provides an extensive collection of code snippets to scaffolding out many different Azure resources. Schema support provides template autocompletion and linting-like validation. A template navigation pane makes finding resources in a larger template easy. There is also support for parameter files, linked templates, and more.

HashiCorp Terraform

Terraform is an offering of HashiCorp. They’ve provided an extension that supports Terraform files (.tf and .tfvars), including syntax highlighting. While there are only a few snippets included, the autocompletion when defining new blocks (i.e., resources, data) is quite extensive.

Figure 3 – Terraform autocompletion

GitLens – Git Supercharged

GitLens is full of features that make tracking changes in code easily accessible. I installed this extension for the “Current Line Blame” feature that shows who changed the current line last, when they changed it and more. In addition, there are sidebar views for branches, remotes, commits, and file history that I use regularly. There are several other features that I either don’t use or even wasn’t aware of until writing this post, as well this is an excellent tool for Git repo users.
GitLens Line Blame

MSBuild Project Tools

I had a recent project that contained a relatively large MSBuild deployment package that needed to be updated to work with the changes made to migrate the application to Azure. I haven’t worked with MSBuild in several years. When I did, I didn’t have all the syntax and keywords committed to memory. This extension provides some essential support, including element completion and syntax highlighting. It did make the project a little easier to modify.

PowerShell Preview

I’ve become a bit of a PowerShell fan. I had been introduced to it when I was working with SharePoint, but since I’ve been doing DevOps work in conjunction with Azure, I’ve started enjoying writing scripts. The less-than-ideal support for PowerShell (at the time, at least) in Visual Studio 20xx was the main reason I gave VS Code a shot. This extension (or the stable PowerShell extension) provides the excellent IntelliSense, code snippets, and syntax highlighting you’d expect. However, it also has “Go to Definition” and “Find References” features that I relied on when writing C#. In addition, it incorporates linting/code analysis with PowerShell Script Analyzer, which helps you develop clean code that follows best practices.

PowerShell Preview

Powershell (stable)

Wrapping Up

I have far more than these extensions installed, but these are the ones I use the most when doing DevOps work. Some of the others either haven’t been used enough yet, aren’t helpful for a DevOps Engineer, or weren’t interesting enough to list for the sake of brevity.

However, I’ve created a Gist on my GitHub that contains the complete list of extensions I have installed if that’s of interest. Visual Studio Code is an amazing tool that, along with the proper configuration and extensions, has increased my productivity as a DevOps Engineer.

A DoD client requested support with automated file transfers. The client has files placed in a common folder that can be accessed by the standard File Transfer Protocol (FTP). Given the FTP server’s connection information, the client requested the files to be moved to an Amazon Web Services (AWS) S3 bucket where their analysis tools are configured to use.

Automating the download and upload process would save users time by allowing for a scheduled process to transfer data files. This can be achieved using a combination of AWS Lambda and EC2 services. AWS Lambda provides a plethora of triggering and scheduling options and the power to create EC2 instances. By creating an EC2 example, a program or script can avoid Lambdas’ limitations and perform programmatic tasking such as downloading and uploading. Additionally, this can be done using Terraform to allow for deployment in any AWS space.

Writing a Script to Do the Work

Create a Script that can log in to the FTP server, fetch/download files, and copy them to an S3 bucket before using Terraform or AWS console. This can be done effectively with Python’s built-in FTPlib and the AWS boto3 API library. There are various libraries and examples online to show how to set up a Python script to download files from an FTP server and use the boto3 library to copy them to S3.

Consider writing the script that file size will play a significant role in how FTPlib and Boto3’s copy functions work. Anything over 5GB will need to be chunked from the FTP Server and use the multiple file upload methods for the AWS API.

Creating an Instance with Our Script Loaded

Amazon provides Amazon Managed Images (AMI) to start up a basic instance. The provided Linux x86 AMI is the perfect starting place for creating a custom instance and eventually custom AMI.

With Terraform, creating an instance is like creating any other module, requiring Identity and Access Management (IAM) permissions, security group settings, and other configuration settings. The following shows the necessary items needed to make an EC2 instance with a key-pair, permissions to write to s3, install Python3.8 and libraries, and copy the script to do the file transferring into the ec2-user directory.

First, generating a key-pair, a private key, and a public key is used to prove identity when connecting to an instance. The benefit of creating the key-pair in the AWS Console is access to the generated .pem file. Having a local copy will allow for connecting to the instance via the command line, while great for debugging, but not great for deployment. Terraform can be generated and store a key-pair in its memory to avoid passing sensitive information.

# Generate a ssh key that lives in terraform
# https://registry.terraform.io/providers/hashicorp/tls/latest/docs/resources/private_key
resource "tls_private_key" "instance_private_key" {
  algorithm = "RSA"
  rsa_bits  = 4096

resource "aws_key_pair" "instance_key_pair" {
  key_name   = "${var.key_name}"
  public_key = "${tls_private_key.instance_private_key.public_key_openssh}"


To set up the secrSetup, which is the security group to run the instance in, open up the ports for Secure Shell (SSH) and Secure Copy Protocol (SCP) to copy the script file(s) to the instance. A security group acts as a virtual firewall for your EC2 instances to control incoming and outgoing traffic. Then, open other ports for ingress and egress as needed, i.e. 443 for HTTP traffic. The security group will require the vpc_id for your project. This is the Visual Private Cloud (VPC) that the instance will be running. The security group should match up with your VPC settings.

resource "aws_security_group" "instance_sg" {
  name   = "allow-all-sg"
  vpc_id = "${var.vpc_id}"
  ingress {
    description = "ftp port"
    cidr_blocks = [""]
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"

The IAM policy, for instance, will require PutObject access to the S3 bucket. The Terraform module will need the S3 bucket as an environment variable, and a profile instance is created. If creating the IAM policy in the AWS Console, a profile instance is automatically created, but it has to be explicitly defined in Terraform.

#iam instance profile setup
resource "aws_iam_role" "instance_s3_access_iam_role" {
  name               = "instance_s3_access_iam_role"
  assume_role_policy = <<EOF
  "Version": "2012-10-17",
  "Statement": [
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      "Effect": "Allow",
      "Sid": ""
resource "aws_iam_policy" "iam_policy_for_ftp_to_s3_instance" {
  name = "ftp_to_s3_access_policy"

  policy = <<EOF
  "Version": "2012-10-17",
  "Statement": [
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
      "Resource": "arn:aws:s3:::${var.s3_bucket}"

resource "aws_iam_role_policy_attachment" "ftp_to_s3" {
  role       = aws_iam_role.instance_s3_access_iam_role.name
  policy_arn = aws_iam_policy.iam_policy_for_ftp_to_s3_instance.arn

resource "aws_iam_instance_profile" "ftp_to_s3_instance_profile" {
  name = "ftp_to_s3_instance_profile"
  role = "instance_s3_access_iam_role"

Defining the instance to start from and create the custom AMI from in the Terraform will need the following variables:

  • AMI – the AMI of the Linux x86 image
  • instance_type – the type of instance, i.e., t2.micro
  • subnet_id – the subnet string from which VPC the instance will run on
  • key-name – the name of the key, should match the key-pair name generated above or the one from the AWS console, could use a variable reference here too

Define the connection and provisioner attributes to copy the python script to do the file transferring to the ec2-user home folder. The connection will use the default ec2-user using the secure key and then copy over the python file. If using the key downloaded from AWS Console, use the following to point to the file private_key = “${file (“path/to/key-pair-file.pem”)}”.

Complete the instance setup with the correct Python version and library. The user_data attribute sends a bash script to install whatever is needed— in this case, updating Python to 3.8, installing the boto3, and paramiko libraries.

# Instance that we want to build out
resource "aws_instance" "ftp-to-s3-instance" {
  ami           = var.ami
  instance_type = var.instance_type
  subnet_id     = var.subnet_id
  key_name 	   = "${var.key_name}" #use your own key for testing
  security_groups      = ["${aws_security_group.instance_sg.id}"]
  iam_instance_profile = "${aws_iam_instance_profile.ftp_to_s3_instance_profile.id}"

  # Copies the python file to /home/ec2-user
  # depending on how the install of python works we may need to change this location
  connection {
    type        = "ssh"
    user        = "ec2-user"
    host        = "${element(aws_instance.ftp-to-s3-instance.*.public_ip, 0)}"
    private_key = "${tls_private_key.instance_private_key.private_key_pem}"

  provisioner "file" {
    source      = "${path.module}/ftp_to_s3.py"
    destination = "/home/ec2-user/ftp_to_s3.py"

  user_data = <<EOF
sudo amazon-linux-extras install python3.8
python3.8 -m pip install -U pip
pip3.8 --version
pip3.8 install boto3 
pip3.8 install paramiko 


The last step is to create the custom AMI. This will allow our Lambda to duplicate and make as many of these instances as need.

resource "aws_ami_from_instance" "ftp-to-s3-ami" {
  name               = "ftp-to-s3_ami"
  description        = "ftp transfer to s3 bucket python 3.8 script"
  source_instance_id = "${aws_instance.ftp-to-s3-instance.id}"

  depends_on = [aws_instance.ftp-to-s3-instance]

  tags = {
    Name = "ftp-to-s3-ami"

Creating Instances on the Fly in Lambda

Using a Lambda function that can be triggered in various ways is a straightforward way to invoke EC2 instances. The following python code show passing in environment variables to be used in an EC2 instance as both environment variables in the instance and arguments passed to the Python script. The variables needed in the python script for this example are as followed:

  • FTP_HOST – the URL of the FTP server
  • FTP_PATH – the path to the files on the URL server
  • FTP_USERNAME, FTP_PASSWORD, FTP_AUTH – to be used for any authentication for the FTP SERVER
  • S3_BUCKET_NAME – the name of the bucket for the files
  • S3_PATH – the folder or path files should be downloaded to in the S3 bucket
  • Files_to_download – for this purpose, a python list of dictionary objects with filename and size to downloaded.

For this example, the logic for checking for duplicate files is down before the Lambda invoking the instance for transferring is called. This allows the script in the instance to remain singularly focused on downloading and uploading. It is important to note that the files_to_download variable is converted to a string, and the quotes are made into double-quotes. Not doing this will make the single quotes disappear when passing to the EC2 instance.

The init_script variable will use the passed-in event variables to set up the environment variables and python script arguments. Just like when creating the instance, the user_data script is run by the instance’s root user. The root user will need to use the ec2-user’s python to run our script with the following bash command: PYTHONUSERBASE=/home/ec2-user/.local python3.8 /home/ec2-user/ftp_to_s3.py {s3_path} {files_to_download}.

# convert to string with double quotes so it knows its a string
    files_to_download = ",".join(map('"{0}"'.format, files_to_download))
    vars = {
        "FTP_HOST": event["ftp_url"],
        "FTP_PATH": event["ftp_path"],
        "FTP_USERNAME": event["username"],
        "FTP_PASSWORD": event["password"],
        "FTP_AUTH_KEY": event["auth_key"],
        "S3_BUCKET_NAME": event["s3_bucket"],
        "files_to_download": files_to_download,
        "S3_PATH": event["s3_path"],

    init_script = """#!/bin/bash
                /bin/echo "**************************"
                /bin/echo "* Running FTP to S3.     *"
                /bin/echo "**************************"
                export S3_BUCKET_NAME={S3_BUCKET_NAME}
                export PRODUCTS_TABLE={PRODUCTS_TABLE}
                export FTP_HOST={FTP_HOST}
                export FTP_USERNAME={FTP_USERNAME}
                export FTP_PASSWORD={FTP_PASSWORD}
                PYTHONUSERBASE=/home/ec2-user/.local python3.8 /home/ec2-user/ftp_to_s3.py {s3_path} {files_to_download}
                shutdown now -h""".format(

Invoke the instance with the boto3 library providing the parameters for the custom image AMI, Instance type, key-pair, subnet, and instance profile, all defined by Terraform environment variables. Optionally, set the Volume size to 50GB from the default 8GB for larger files.

instance = ec2.run_instances(
        IamInstanceProfile={"Arn": INSTANCE_PROFILE},
        BlockDeviceMappings=[{"DeviceName": "/dev/xvda", "Ebs": {"VolumeSize": 50}}],


After deploying to AWS, Terraform will have created a Lambda that invokes an EC2 instance running the script passed to it during its creation. Triggering the Lambda function to invoke the custom instance can be done from a DynamoDB Stream update, scheduled timer, or even another Lambda function. This provides flexibility on how and when the instance is called.

Ultimately, this solution provides a flexible means of downloading files from an FTP server. Changes to the Lambda invoking the instance could include separating the file list to create several more minor instances to run simultaneously, moving more files faster to the AWS S3 bucket. This greatly depends on the client’s needs and the cost of operating the AWS services.

Changes can also be made to the script downloading the files. One option would be to use more robust FTP libraries than the built-in provided python library. Larger files may require more effort as FTP servers can timeout when network latency and file sizes come into play. Python’s FTPlib does not auto-reconnect, nor does it keep track of incomplete file downloads.