When running production workloads that leverage enterprise services, like DNS, we expect these services to be reliable, available, and responsive. It is assumed that when a client requests an enterprise service, it is logged and takes the most efficient and lowest latency route. In addition, a proper response is expected, successful or fail. What about when that enterprise service cannot handle the number of incoming requests and/or fails to send a proper response? In this blog, I walk through a DNS situation where Azure Databricks bombards DNS servers in attempts to authenticate with Azure AD to access data over an Azure Datalake Storage Gen2 datastore mount. I then provide a solution that, hopefully, allows for a more dynamic DNS implementation in your environment.

Note:This blog post compliments the below-linked article, which was used as a guide to create the init script and implement custom DNS routing.

Configure custom DNS settings using dnsmasq
https://docs.microsoft.com/en-us/azure/databricks/kb/cloud/custom-dns-routing

Note:The Failed Job Message in this blog post is directly related to using Azure Datalake Storage Gen2 (ADLS) as a persistent datastore mount per the configuration guidance detailed below.

Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth 2.0 with an Azure service principal
https://docs.databricks.com/data/data-sources/azure/azure-storage.html#access-azure-data-lake-storage-gen2-or-blob-storage-using-oauth-20-with-an-azure-service-principal

Situation

An enterprise DNS implementation that funnels all requests to a pool of DNS servers for public, private, and Azure Private Link domain/hostname resolution. DNS settings are enforced through virtual network custom DNS settings (DHCP) and other configuration management tools and techniques. Azure Service Endpoint traffic is also routed to only authorized virtual network subnets. Due to strict security policies that enforce logging and traceability, your typical Azure Databricks dedicated subnet DNS traffic can only take advantage of certain Azure DNS optimization and performant routing features, with node-level customization, that is. DNS whitelisting is enabled on the domain “login.microsoftonline.com” to reduce the amount of logging data generated and increase DNS Server performance. This domain is used (heavily) for authentication with Azure Active Directory (AAD) when accessing Azure resources or any Relying Party Trusts using AAD as an Identity Provider.

Assumptions

We assume a managed Azure Databricks Workspace with virtual network integration (public & private subnets).

Azure Databricks cluster nodes (Ubuntu) receive their DNS configuration directly from the virtual network through DHCP. Only the first three (3) DNS servers in the resolv.conf are utilized, even though more than three (3) can be configured.

The Azure Databricks Workspace uses the ADLS Gen2 persistent mounted datastore with a Service Principal and OAuth 2.0 to process data and generates over ten (10) million DNS requests daily. Jobs fail sporadically with the Failed Job Message below. The failures are not reproducible on-demand since domain/hostname resolution succeeds more often than not using the enterprise DNS servers.

Failed Job Message

Job aborted.
Caused by: Job aborted due to stage failure.
Caused by: FileReadException: Error while reading file dbfs://.
Caused by: AbfsRestOperationException: HTTP Error -1; url=’https://login.microsoftonline.com//oauth2/token’ AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException : login.microsoftonline.com
Caused by: AzureADAuthenticator.HttpException: HTTP Error -1; url=’https://login.microsoftonline.com//oauth2/token’ AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException : login.microsoftonline.com

Troubleshooting

A few failed corrective action attempts.

  1. Utilize the host node VIP (168.63.129.16) as a primary, secondary, or tertiary DNS resolver configured in the Azure Virtual Network custom DNS settings.
    a. Outcome: Azure Private Endpoint resolution failure. For this situation, all DNS requests are sent to a pool of DNS servers, as stated above. The Databricks Workspace integrated virtual network is not associated with any Azure Private DNS Zones, so when using this private access method Private Endpoint domain/hostname is not resolvable (properly routed) by 168.63.129.16. Instead, the public address is returned, and traffic is denied at the nearest firewall.
  2. Trying to use the Configure custom DNS settings using dnsmasq Microsoft provided content as is (linked above).
    a. Firewalls will need to allow access from the Azure Databricks subnet(s) to “archive.ubuntu.com”. This is required to install and update Ubuntu packages (dnsmasq), unless your organization has implemented other init scripts to adjust these default settings.
    b. Modify the init script itself to:
Modify Init Script

Solution

Since DNS requests for “login.microsoftonline.com” are already being whitelisted (not logged), the preferred solution is to route this traffic directly, using dnsmasq via a cluster level init script, to the Azure host node virtual IP address (168.63.129.16). The VIP address is accessible by all of the Azure Databricks cluster nodes for certain services and not subject to network security group rules except by service tag, per the article linked below. With this DNS configuration, DNS requests/responses for “login.microsoftonline.com” no longer depend on enterprise DNS or subject to the routing and processing rules, delays, and latency introduced with custom/complex enterprise DNS implementations. The sporadic “java.net.UnknownHostException” for “login.microsoftonline.com” no longer exists on these Azure Databricks clusters.

What is IP address 168.63.129.16?
https://docs.microsoft.com/en-us/azure/virtual-network/what-is-ip-address-168-63-129-16

Azure Databricks Notebook Cell Diff Content to Create Init Script

Azure Databricks Notebook

Final Azure Databricks Notebook Cell Content to Create Init Script

Azure Databricks Cell Content

What is Azure Databricks?

Azure Databricks is a data analytics platform that provides powerful computing capability, and the power comes from the Apache Spark cluster. In addition, Azure Databricks provides a collaborative platform for data engineers to share the clusters and workspaces, which yields higher productivity. Azure Databricks plays a major role in Azure Synapse, Data Lake, Azure Data Factory, etc., in the modern data warehouse architecture and integrates well with these resources.

Data engineers and data architects work together with data and develop the data pipeline for data ingestion with data processing. All data engineers work in a sandbox environment, and when they have verified the data ingestion process, the data pipeline is ready to be moved to Dev/Staging and Production.

Manually moving the data pipeline to staging/production environments via Azure portal will potentially introduce the difference in environments and add a tedious task to repeat manual processes in multiple environments. Automated deployment with service principal credentials is the only solution to move all your work to higher environments. There will be no privilege to configure via the Azure portal as a user. As data engineers complete the data pipeline, Cloud automation engineers will use IaC (Infrastructure as Code) to deploy all Azure resources and configure them via the automation pipeline. That includes all data related to Azure resources and Azure Databricks.

Data engineers work in Databricks with their user account, and it works very well integrating Azure Databricks with Azure key vault using key vault secret scope. All the secrets are persisted in key vault, and Databricks can get the secret value directly via linked service. Databricks uses user credentials to go against Keyvault to get the secret values. This does not work with service principal (SPN) access from Azure Databricks to the key vault. This functionality is requested but not yet there as per this GitHub issue.

JOIN OUR TEAM
Passionate about data? Check out our open data careers and apply to join our quickly growing team today!

Let’s Look at a Scenario

The data team has given automation engineers two requirements:

  • Deploy an Azure Databricks, a cluster, a dbc archive file which contains multiple notebooks in a single compressed file (for more information on dbc file, read here), secret scope, and trigger a post-deployment script.
  • Create a key vault secret scope local to Azure Databricks so the data ingestion process will have secret scope local to Databricks.

Azure Databricks is an Azure native resource, but any configurations within that workspace is not native to Azure. Azure Databricks can be deployed with Hashicorp Terraform code. For Databricks workspace-related artifacts, the Databricks provider needs to be added. For creating a cluster, use this implementation. If you are only uploading a single notebook file for creating a notebook, then use Terraform implementation like this. If not, there is an example below to use Databricks CLI to upload multiple notebook files as a single dbc archive file. The link to my GitHub repo for complete code is at the end of this blog post.

Terraform implementation

terraform {
  required_providers {
    azurerm = "~> 2.78.0"
    azuread = "~> 1.6.0"
    databricks = {
      source = "databrickslabs/databricks"
      version = "0.3.7"
    }
  }

  backend "azurerm" {
    resource_group_name  = "tf_backend_rg"
    storage_account_name = "tfbkndsapoc"
    container_name       = "tfstcont"
    key                  = "data-pipe.tfstate"
  }
}

provider "azurerm" {
  features {}
}

provider "azuread" {
}

data "azurerm_client_config" "current" {
}

// Create Resource Group
resource "azurerm_resource_group" "rgroup" {
  name     = var.resource_group_name
  location = var.location
}

// Create Databricks
resource "azurerm_databricks_workspace" "databricks" {
  name                          = var.databricks_name
  location                      = azurerm_resource_group.rgroup.location
  resource_group_name           = azurerm_resource_group.rgroup.name
  sku                           = "premium"
}

// Databricks Provider
provider "databricks" {
  azure_workspace_resource_id = azurerm_databricks_workspace.databricks.id
  azure_client_id             = var.client_id
  azure_client_secret         = var.client_secret
  azure_tenant_id             = var.tenant_id
}

resource "databricks_cluster" "databricks_cluster" {
  depends_on              = [azurerm_databricks_workspace.databricks]
  cluster_name            = var.databricks_cluster_name
  spark_version           = "8.2.x-scala2.12"
  node_type_id            = "Standard_DS3_v2"
  driver_node_type_id     = "Standard_DS3_v2"
  autotermination_minutes = 15
  num_workers             = 5
  spark_env_vars          = {
    "PYSPARK_PYTHON" : "/databricks/python3/bin/python3"
  }
  spark_conf = {
    "spark.databricks.cluster.profile" : "serverless",
    "spark.databricks.repl.allowedLanguages": "sql,python,r"
  }
  custom_tags = {
    "ResourceClass" = "Serverless"
  }
}

GitHub Actions workflow with Databricks CLI implementation

deploydatabricksartifacts:
    needs: [terraform]
    name: 'Databricks Artifacts Deployment'
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2.3.4
   
    - name: Set up Python 3.0
      uses: actions/setup-python@v2
      with:
        python-version: 3.0

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip

    - name: Download Databricks CLI
      id: databricks_cli
      shell: pwsh
      run: |
        pip install databricks-cli
        pip install databricks-cli --upgrade

    - name: Azure Login
      uses: azure/login@v1
      with:
        creds: ${{ secrets.AZURE_CREDENTIALS }}
   
    - name: Databricks management
      id: api_call_databricks_manage
      shell: bash
      run: |
        # Set DataBricks AAD token env
        export DATABRICKS_AAD_TOKEN=$(curl -X GET -d "grant_type=client_credentials&client_id=${{ env.ARM_CLIENT_ID }}&resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d&client_secret=${{ env.ARM_CLIENT_SECRET }}" https://login.microsoftonline.com/${{ env.ARM_TENANT_ID }}/oauth2/token | jq -r ".access_token")

        # Log into Databricks with SPN
        databricks_workspace_url="https://${{ steps.get_databricks_url.outputs.DATABRICKS_URL }}/?o=${{ steps.get_databricks_url.outputs.DATABRICKS_ID }}"
        databricks configure --aad-token --host $databricks_workspace_url

        # Check if workspace notebook already exists
        export DB_WKSP=$(databricks workspace ls /${{ env.TF_VAR_databricks_notebook_name }})
        if [[ "$DB_WKSP" != *"RESOURCE_DOES_NOT_EXIST"* ]];
        then
          databricks workspace delete /${{ env.TF_VAR_databricks_notebook_name }} -r
        fi

        # Import DBC archive to Databricks Workspace
        databricks workspace import Databricks/${{ env.databricks_dbc_name }} /${{ env.TF_VAR_databricks_notebook_name }} -f DBC -l PYTHON

While the above example shows how to leverage Databricks CLI to do automation operations within Databricks, Terraform also provides richer capabilities with Databricks providers. Here is an example of how to add ‘service principal’ to Databricks ‘admins’ group in workspace using Terraform. This is essential for Databricks API to work when connecting as a service principal.

Databricks Creating Cluster
Databricks cluster deployed via Terraform
Jobs Deployed via Terraform
No Jobs have been deployed via Terraform
Databricks CLI
 Job deployed using Databricks CLI in GitHub Actions workflow
Deployment with Databricks
Job triggered via Databricks CLI in GitHub Actions workflow

Not just Terraform and Databricks CLI, but also Databricks API provides similar options to access Databricks artifacts and manage them. For example, to access the clusters in the Databricks:

  • To access clusters, first, authenticate if you are a workspace user via automation or using service principal.
  • If your service principal is already part of the workspaces admins group, use this API to get the clusters list.
  • If the service principal (SPN) is not part of the workspace, use this API that uses access and management tokens.
  • If you would rather add the service principal to Databricks admins workspace group, use this API (same as Terraform option above to add the SPN).

The secret scope in Databricks can be created using Terraform or using Databricks CLI or using Databricks API!

Databricks with other Azure resources have pretty good documentation, and for automating deployments, these options are essential: learn and use the best option that suits the needs!

Here is the link to my GitHub repo for complete code on using Terraform, Databricks CLI in GitHub Actions! In addition, you can find a bonus learning how to deploy synapse, ADLS, etc., as part of modern data warehouse deployment, which I will cover in my next blog post.

Until then, happy automating!