What is Azure Databricks?

Azure Databricks is a data analytics platform that provides powerful computing capability, and the power comes from the Apache Spark cluster. In addition, Azure Databricks provides a collaborative platform for data engineers to share the clusters and workspaces, which yields higher productivity. Azure Databricks plays a major role in Azure Synapse, Data Lake, Azure Data Factory, etc., in the modern data warehouse architecture and integrates well with these resources.

Data engineers and data architects work together with data and develop the data pipeline for data ingestion with data processing. All data engineers work in a sandbox environment, and when they have verified the data ingestion process, the data pipeline is ready to be moved to Dev/Staging and Production.

Manually moving the data pipeline to staging/production environments via Azure portal will potentially introduce the difference in environments and add a tedious task to repeat manual processes in multiple environments. Automated deployment with service principal credentials is the only solution to move all your work to higher environments. There will be no privilege to configure via the Azure portal as a user. As data engineers complete the data pipeline, Cloud automation engineers will use IaC (Infrastructure as Code) to deploy all Azure resources and configure them via the automation pipeline. That includes all data related to Azure resources and Azure Databricks.

Data engineers work in Databricks with their user account, and it works very well integrating Azure Databricks with Azure key vault using key vault secret scope. All the secrets are persisted in key vault, and Databricks can get the secret value directly via linked service. Databricks uses user credentials to go against Keyvault to get the secret values. This does not work with service principal (SPN) access from Azure Databricks to the key vault. This functionality is requested but not yet there as per this GitHub issue.

JOIN OUR TEAM
Passionate about data? Check out our open data careers and apply to join our quickly growing team today!

Let’s Look at a Scenario

The data team has given automation engineers two requirements:

  • Deploy an Azure Databricks, a cluster, a dbc archive file which contains multiple notebooks in a single compressed file (for more information on dbc file, read here), secret scope, and trigger a post-deployment script.
  • Create a key vault secret scope local to Azure Databricks so the data ingestion process will have secret scope local to Databricks.

Azure Databricks is an Azure native resource, but any configurations within that workspace is not native to Azure. Azure Databricks can be deployed with Hashicorp Terraform code. For Databricks workspace-related artifacts, the Databricks provider needs to be added. For creating a cluster, use this implementation. If you are only uploading a single notebook file for creating a notebook, then use Terraform implementation like this. If not, there is an example below to use Databricks CLI to upload multiple notebook files as a single dbc archive file. The link to my GitHub repo for complete code is at the end of this blog post.

Terraform implementation

terraform {
  required_providers {
    azurerm = "~> 2.78.0"
    azuread = "~> 1.6.0"
    databricks = {
      source = "databrickslabs/databricks"
      version = "0.3.7"
    }
  }

  backend "azurerm" {
    resource_group_name  = "tf_backend_rg"
    storage_account_name = "tfbkndsapoc"
    container_name       = "tfstcont"
    key                  = "data-pipe.tfstate"
  }
}

provider "azurerm" {
  features {}
}

provider "azuread" {
}

data "azurerm_client_config" "current" {
}

// Create Resource Group
resource "azurerm_resource_group" "rgroup" {
  name     = var.resource_group_name
  location = var.location
}

// Create Databricks
resource "azurerm_databricks_workspace" "databricks" {
  name                          = var.databricks_name
  location                      = azurerm_resource_group.rgroup.location
  resource_group_name           = azurerm_resource_group.rgroup.name
  sku                           = "premium"
}

// Databricks Provider
provider "databricks" {
  azure_workspace_resource_id = azurerm_databricks_workspace.databricks.id
  azure_client_id             = var.client_id
  azure_client_secret         = var.client_secret
  azure_tenant_id             = var.tenant_id
}

resource "databricks_cluster" "databricks_cluster" {
  depends_on              = [azurerm_databricks_workspace.databricks]
  cluster_name            = var.databricks_cluster_name
  spark_version           = "8.2.x-scala2.12"
  node_type_id            = "Standard_DS3_v2"
  driver_node_type_id     = "Standard_DS3_v2"
  autotermination_minutes = 15
  num_workers             = 5
  spark_env_vars          = {
    "PYSPARK_PYTHON" : "/databricks/python3/bin/python3"
  }
  spark_conf = {
    "spark.databricks.cluster.profile" : "serverless",
    "spark.databricks.repl.allowedLanguages": "sql,python,r"
  }
  custom_tags = {
    "ResourceClass" = "Serverless"
  }
}

GitHub Actions workflow with Databricks CLI implementation

deploydatabricksartifacts:
    needs: [terraform]
    name: 'Databricks Artifacts Deployment'
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2.3.4
   
    - name: Set up Python 3.0
      uses: actions/setup-python@v2
      with:
        python-version: 3.0

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip

    - name: Download Databricks CLI
      id: databricks_cli
      shell: pwsh
      run: |
        pip install databricks-cli
        pip install databricks-cli --upgrade

    - name: Azure Login
      uses: azure/login@v1
      with:
        creds: ${{ secrets.AZURE_CREDENTIALS }}
   
    - name: Databricks management
      id: api_call_databricks_manage
      shell: bash
      run: |
        # Set DataBricks AAD token env
        export DATABRICKS_AAD_TOKEN=$(curl -X GET -d "grant_type=client_credentials&client_id=${{ env.ARM_CLIENT_ID }}&resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d&client_secret=${{ env.ARM_CLIENT_SECRET }}" https://login.microsoftonline.com/${{ env.ARM_TENANT_ID }}/oauth2/token | jq -r ".access_token")

        # Log into Databricks with SPN
        databricks_workspace_url="https://${{ steps.get_databricks_url.outputs.DATABRICKS_URL }}/?o=${{ steps.get_databricks_url.outputs.DATABRICKS_ID }}"
        databricks configure --aad-token --host $databricks_workspace_url

        # Check if workspace notebook already exists
        export DB_WKSP=$(databricks workspace ls /${{ env.TF_VAR_databricks_notebook_name }})
        if [[ "$DB_WKSP" != *"RESOURCE_DOES_NOT_EXIST"* ]];
        then
          databricks workspace delete /${{ env.TF_VAR_databricks_notebook_name }} -r
        fi

        # Import DBC archive to Databricks Workspace
        databricks workspace import Databricks/${{ env.databricks_dbc_name }} /${{ env.TF_VAR_databricks_notebook_name }} -f DBC -l PYTHON

While the above example shows how to leverage Databricks CLI to do automation operations within Databricks, Terraform also provides richer capabilities with Databricks providers. Here is an example of how to add ‘service principal’ to Databricks ‘admins’ group in workspace using Terraform. This is essential for Databricks API to work when connecting as a service principal.

Databricks Creating Cluster
Databricks cluster deployed via Terraform
Jobs Deployed via Terraform
No Jobs have been deployed via Terraform
Databricks CLI
 Job deployed using Databricks CLI in GitHub Actions workflow
Deployment with Databricks
Job triggered via Databricks CLI in GitHub Actions workflow

Not just Terraform and Databricks CLI, but also Databricks API provides similar options to access Databricks artifacts and manage them. For example, to access the clusters in the Databricks:

  • To access clusters, first, authenticate if you are a workspace user via automation or using service principal.
  • If your service principal is already part of the workspaces admins group, use this API to get the clusters list.
  • If the service principal (SPN) is not part of the workspace, use this API that uses access and management tokens.
  • If you would rather add the service principal to Databricks admins workspace group, use this API (same as Terraform option above to add the SPN).

The secret scope in Databricks can be created using Terraform or using Databricks CLI or using Databricks API!

Databricks with other Azure resources have pretty good documentation, and for automating deployments, these options are essential: learn and use the best option that suits the needs!

Here is the link to my GitHub repo for complete code on using Terraform, Databricks CLI in GitHub Actions! In addition, you can find a bonus learning how to deploy synapse, ADLS, etc., as part of modern data warehouse deployment, which I will cover in my next blog post.

Until then, happy automating!

Have you been in a situation where an Azure Resource Manager (ARM) template with Azure Blueprint is the only option available for Infrastructure as Code (IaC) deployment? Have you realized that certain operations during deployment are not possible with an ARM template? Have you been in a situation that PowerShell or Azure CLI can get the job done, but there is no way to inject that script into the ARM template? Maybe you have felt like ARM-disabled at this time. If you answered yes to at least one of the situations, this blog is for you. Let’s dive into a cool feature, ‘Deployment Scripts in ARM Templates,’ which helped me to overcome these hurdles!

Azure managed disks by default have Server-Side Encryption (SSE) with Platform Managed Key (PMK), identified as SSE + PMK. I had the requirement to encrypt VM’s (Windows or Linux) with either SSE or Azure Disk Encryption (ADE) with Customer Managed Key (CMK). CMK provides an additional security layer as the customer can manage the keys, and they can rotate the keys periodically. Both types of encryptions require an RSA-based Key Vault key. While you can implement an SSE and ADE with CMK can using PowerShell and Azure CLI, I only had an ARM template deployment option. An ARM template has the functionality to create key vault secrets but cannot create key vault keys. For SSE encryption, a ‘disk encryption set‘ needs to be created. In the automated deployment, the key vault key and disk encryption set must exist for the virtual machine deployment to consume the key vault key to encrypt the VM and OS/Data disks.

The following picture shows default encryption on a VM managed disk – SSE with PMK (by default):

VM Disk

Choosing encryption types is based on customer requirements. Azure offers SSE + PMK as a default feature, which provides encryption at rest. Either SSE + CMK or ADE + CMK can be applied on top of default encryption. For more information to understand SSE and ADE, please read this great blog post. It explains fundamental differences between these types of encryptions when to choose one over the other, and the caveats to watch out after the encryption is applied.

Microsoft explains the difference between SSE and ADE by stating that Azure Disk Encryption leverages the BitLocker feature of Windows to encrypt managed disks with Customer-Managed Keys within the guest VM. Server-Side Encryption with Customer-Managed Keys improves on ADE by enabling you to use any OS type and images for your VMs by encrypting data in the Storage service.

The following screenshot shows Managed disk SSE + CMK encryption implemented via ARM template at the time of VM creation using Disk Encryption Set:

SSE + CMK encryption

The following screenshot shows Managed disk ADE + CMK encryption implemented via ARM Custom Extension:

ADE + CMK

The following screenshot shows ARM Custom Extension for ADE + CMK encryption:

ARM Custom Extension

The following screenshot shows how Key Vault secret is used by writing ‘Encrypted BEK’ for ADE + CMK encryption:

Key Vault secret

While I learned more about disk encryption, I was still contemplating disk encryption options in the ARM template. Microsoft ARM template team announced this cool new feature, deployment scripts! This was announced at the MS Build Event, which you can view here for more information. Deployment scripts are a new resource in ARM that can run PowerShell and Azure CLI scripts within the ARM template! The feature came in time to resolve ARM deployment impediments. The implementation creates a storage account to copy the script from the ARM template into file share and run an Azure container instance to execute the script. A user-managed identity is created and given permission to the resource group and added to key vault access policies. The script is executed as this user-managed identity. At this time, no system assigned identity is supported. The deployment scripts resource is available in both Azure public and government, wherever Azure container instances are available. Both the storage account and container instances are deleted after successful deployment scripts resource deployment.

The following screenshot is the deployment scripts ARM template resource:

Deployment Scripts ARM template

The following screenshot shows a container instance and storage account resources used by deployment scripts. They will be deleted after successful deployment script execution. The user managed identity is created in an ARM to execute the deployment script. The deployment script creates Key Vault, Keys in key vault, and Disk Encryption Set.

Container Instance

Additional features in deployment scripts resource:

  • The scripts can be either embedded inline or referenced from a location accessible from deployment.
  • The output from deployment scripts resource can be consumed/referenced by other resources in the deployment
  • An existing storage account can be referenced, and that storage account will be used instead of creating a temporary storage account.
  • Any task done via PowerShell and Azure CLI can be done in the deployment scripts, such as disk encryption, the storage account can be encrypted with CMK via PowerShell in deployment scripts.

Security considerations:

  • Deployment scripts create Standard_LRS storage account SKU, and if a Geo-Redundant security policy is enabled then the policy might fail and report. As the storage account gets deleted after the deployment scripts are done, the policy will not be non-compliant.
  • The storage account access should be public when using the existing storage account for the deployment script. This is to allow the script to be copied to file share and by permission for the container, instance to execute the script.
  • The Key Vault firewall needs to be turned off for the deployment scripts to be able to perform operations in key vault key and certificates. But the firewall can be enabled back in the PowerShell after all the work is done.

The following screenshot shows ARM Template Key vault resource, no firewall restriction:

ARM Template

In the following screenshot, I enable firewall access to the key vault in PowerShell script after the tasks are done:

Enable Firewall access

ARM template for SSE + CMK disk encryption:

SSE + CMK is applied when the disk-encryption-type parameter is set to ‘SSE’. If it is ‘ADE’ then no SSE is applied.

SSE + CMK is applied

ARM Template for ADE + CMK Disk Encryption using VM Extension

ARM Template for ADE

In the code example, the newly created Azure key vault key is saved to secrets to be able to access from an ARM template. ARM does not have the functionality to access key vault keys but can access secrets using the template reference parameter, like below.

Template Reference parameter

Instead of using as secrets, another approach to write the key vault key (KeK ID) as deployment scripts output and use that output in the VM ARM template.

The following screenshot shows how to write output from deployment scripts resource in keyvault.json:

Write up output

The following screenshot shows how to consume the output written in deployment scripts resource in vm.json for KeK URL:

Consume the output

For full code access to try using ‘deployment scripts’ in an ARM template to create Key vault keys, disk encryption set, and to encrypt the VM disks with SSE or ADE, please follow the link to get the code.

Once downloaded, follow the steps to test in your environment, Azure Cloud or Azure Government. Update the parameters before deploying:

Connect-AzAccount -Tenant "000000-000-0000-0000" -SubscriptionId "000000-000-0000-0000" -EnvironmentName AzureCloud

# Create Resource Group
$rg = New-AzResourceGroup -Name "deploy-script-rg" -Location "eastus2"

# Create Key Vault with ‘deployment scripts’
New-AzResourceGroupDeployment -ResourceGroupName $rg.ResourceGroupName -TemplateFile "C:\deploymentScriptsARM\keyvault.json" -TemplateParameterFile "C:\deploymentScriptsARM\keyvault.parameters.json"

# Create Virtual Machine with disk encryption
New-AzResourceGroupDeployment -ResourceGroupName $rg.ResourceGroupName -TemplateFile "C:\deploymentScriptsARM\vm.json" -TemplateParameterFile "C:\deploymentScriptsARM\vm.parameters.json"

Additional ARM Templates references:

Happy ARM Templating for IaC!!!