Recently, I was involved in a project where we had requirements. A report needed to be built based on data from several sources, including Azure Cosmos DB and Azure SQL. This report was required to be stored in Azure, and data was subject to a retention policy where it would be unable to be deleted. For this use case, we decided to use Azure Data Factory to read the data, transform the data, and load the data into a CSV file within a blob container in an Azure Storage Account. Then, to prevent the deletion of data, we would create a retention policy on the blob container to enable immutable storage, giving us write-once, read-many capabilities.

Azure Cosmos Data Flow

We found that we needed a Data Flow within Azure Data Factory to perform logic such as joining across our data sources. We created an output sink to a Delimited Text output within an Azure Blob Storage Linked Service within our data flow.

Immutable storage policies divided into retention policies and legal holds can be enabled on a storage account to enforce write-once, read-many policies, which allow new documents to be added to the storage account. But, it prevents the modification and deletion of existing documents. In the case of a retention policy, individual blobs cannot be modified or deleted until the retention period has passed for a given blob. In the case of legal holds, no blobs can be modified or deleted until the legal hold is manually removed from the blob container. Legal holds can be used to fulfill various legal requirements, including keeping data under active investigation from being modified. You can read more about immutable storage in the official Microsoft documentation for Immutable storage for blobs.

Our project had legal requirements for retaining these created documents, so we found immutable storage policies to be a good fit. When we enabled Immutable Storage, either through a retention policy or a legal hold on the Blob Container, we encountered one of two errors:

Job failed due to reason: Spark job failed in one of the cluster nodes while writing data in one partition to sink, with following error message: This operation is not permitted as the blob is immutable due to a policy.

OR

Job failed due to reason: Spark job failed in one of the cluster nodes while writing data in one partition to sink, with the following error message: This operation is not permitted as the blob is immutable due to one or more legal holds.

Here, we will examine why this error occurs, and how we can resolve these errors.

Setup

To simplify our example, I will create a simple Azure Data Factory pipeline that uses a data flow to read from an Azure SQL server, performs a Select operation on that data, and then output a CSV file into an Azure Storage Blob Container.

Data Flow pipeline

I created an Azure SQL server to hold our source data. This server has a database named testdb, and a table named MOCK_DATA with a few columns of varying data types. I then inserted some mock data into the table.

SERVICE OFFERING: REHOST ON AZURE
Initiate your cloud journey on Microsoft technologies to accelerate your cloud transformation. Learn more about the benefits of rehosting.

I then created an Azure Storage Account to hold our final data, with a blob container named “output.” This is where I will store the final CSV files output by the Azure Data Factory.

Azure Storage Account

To make these blobs immutable, I added a legal hold to the output container:

Make blobs immutable

I then created a new Azure Data Factory:

Create and Open Azure Data Factory

Opening Azure Data Factory Studio, I created the following. A new Linked Service, associated with my Azure SQL Server:

Create New Linked Service associated with Azure SQL Server

A second Linked Service, associated with Azure Blob Storage:

Associate Linked Service with Azure Blob Storage

An Azure SQL Database Dataset, with the linked service associated to my SQL Database and the table source associated with my mock data table:

Azure SQL Dataset

An Azure Blob Storage Delimited Text Dataset, associated with my Blob Storage Linked Service, and with two parameters, Container and Blob, that are associated with the File path:

Azure Blob Storage Delimited Text Dataset

A Dataflow, which has the following configuration:

  • A parameter named filename
  • A Source associated with the mock data table dataset, with all default options.
  • A Select activity to simulate processing the data to select specific columns
  • A Sink activity with the following settings:
    • Dataset: The delimited text dataset
    • Settings tab:
      • File name option: Output to a single file
      • File name: $filename
      • Optimize: Single partition

Data Flow Activity and Configuration

A Pipeline with a Data Flow activity, associated with the Data Flow that was just built, with the following parameters:

  • Sink parameters:
    • Container: output
    • Blob: A dynamic content value with an empty string, ”
  • Other parameters:
    • filename: Data flow expression, “file1.csv”

Main Data Flow Parameters

When I run this pipeline, I get the following error:

Job failed due to reason: Spark job failed in one of the cluster nodes while writing data in one partition to sink, with the following error message: This operation is not permitted as the blob is immutable due to one or more legal holds.

Solution

Why do we get an error when running this pipeline? When the Spark data cluster associated with the Data Flow activity runs, it creates several temporary files within the Blob Container. It then attempts to append to these files and delete them after processing.

Error when running pipeline

This violates the write-once, read-many restrictions on the storage blob container so that the Data Factory job will fail.

To resolve this issue, I will need to create a temporary location for these files to be built and then copy that data from one storage account to another. First, I will create a new blob storage container with no retention restrictions, which I will call “temp”:

Temporary Blob Storage Container

Then, on the storage account, to clean up the temporary data, I will enable lifecycle management. I created a new lifecycle management rule on the storage account to delete old files by providing the following settings:

  • Rule scope: Limit blobs with filters
  • Blob type: Block blobs
  • Blob subtype: Base blobs
  • Base blobs rules:
    • Add a rule: If base blobs haven’t been modified in 1 day, then delete the blobs
  • Filter set:
    • Blob prefix: temp/*.csv

Base Blobs Subtype

Create the Filter Set

Next, back in Azure Data Factory, in my pipeline, I changed the dataflow sink parameters to use the temp blob container rather than the output blob container:

Change parameters in dataflow pipeline

The Copy Data activity does not work the same way as a Data Flow sink and does not create temporary files in its process. I created a new Copy Data activity in the pipeline, and attached the data flow’s anchor to the copy data activity:

Create New Data Activity in Pipeline

For the Source of the Copy Data activity, I selected the Delimited Text dataset and provided the following settings:

  • Container: temp
  • Blob: file1.csv

For the Sink of the Copy Data activity, I selected the same Delimited Text dataset and provided the following settings:

  • Container: output
  • Blob: file1.csv

Rerun the pipeline. The pipeline should run successfully, and your output file will exist in both the temp and output blob containers. After our lifecycle management policies find that the temporary blob is over a day old, the temporary blob will be deleted, while the output blob will remain. The retention policy will prevent the output blob from being deleted.

Run Pipeline with output and temp blob containers

Summary

The way the Data Flow activity in Azure Data Factory uses temporary files to build delimited text outputs, using a delimited text output as a Data Sink is not possible on an Azure Storage Blob Container with an immutable storage policy. A temporary storage output destination must exist as the output for the data flow, which is then copied over to the final, firm storage destination with a Copy Data activity to work around this restriction. The Copy Data activity is not processed the same way as the Data Flow sink. It does not create temporary files, so it is suitable for copying data from the temporary container to the output container. The temporary container is then cleaned regularly by creating a lifecycle management policy on the storage account.

What is Azure Databricks?

Azure Databricks is a data analytics platform that provides powerful computing capability, and the power comes from the Apache Spark cluster. In addition, Azure Databricks provides a collaborative platform for data engineers to share the clusters and workspaces, which yields higher productivity. Azure Databricks plays a major role in Azure Synapse, Data Lake, Azure Data Factory, etc., in the modern data warehouse architecture and integrates well with these resources.

Data engineers and data architects work together with data and develop the data pipeline for data ingestion with data processing. All data engineers work in a sandbox environment, and when they have verified the data ingestion process, the data pipeline is ready to be moved to Dev/Staging and Production.

Manually moving the data pipeline to staging/production environments via Azure portal will potentially introduce the difference in environments and add a tedious task to repeat manual processes in multiple environments. Automated deployment with service principal credentials is the only solution to move all your work to higher environments. There will be no privilege to configure via the Azure portal as a user. As data engineers complete the data pipeline, Cloud automation engineers will use IaC (Infrastructure as Code) to deploy all Azure resources and configure them via the automation pipeline. That includes all data related to Azure resources and Azure Databricks.

Data engineers work in Databricks with their user account, and it works very well integrating Azure Databricks with Azure key vault using key vault secret scope. All the secrets are persisted in key vault, and Databricks can get the secret value directly via linked service. Databricks uses user credentials to go against Keyvault to get the secret values. This does not work with service principal (SPN) access from Azure Databricks to the key vault. This functionality is requested but not yet there as per this GitHub issue.

JOIN OUR TEAM
Passionate about data? Check out our open data careers and apply to join our quickly growing team today!

Let’s Look at a Scenario

The data team has given automation engineers two requirements:

  • Deploy an Azure Databricks, a cluster, a dbc archive file which contains multiple notebooks in a single compressed file (for more information on dbc file, read here), secret scope, and trigger a post-deployment script.
  • Create a key vault secret scope local to Azure Databricks so the data ingestion process will have secret scope local to Databricks.

Azure Databricks is an Azure native resource, but any configurations within that workspace is not native to Azure. Azure Databricks can be deployed with Hashicorp Terraform code. For Databricks workspace-related artifacts, the Databricks provider needs to be added. For creating a cluster, use this implementation. If you are only uploading a single notebook file for creating a notebook, then use Terraform implementation like this. If not, there is an example below to use Databricks CLI to upload multiple notebook files as a single dbc archive file. The link to my GitHub repo for complete code is at the end of this blog post.

Terraform implementation

terraform {
  required_providers {
    azurerm = "~> 2.78.0"
    azuread = "~> 1.6.0"
    databricks = {
      source = "databrickslabs/databricks"
      version = "0.3.7"
    }
  }

  backend "azurerm" {
    resource_group_name  = "tf_backend_rg"
    storage_account_name = "tfbkndsapoc"
    container_name       = "tfstcont"
    key                  = "data-pipe.tfstate"
  }
}

provider "azurerm" {
  features {}
}

provider "azuread" {
}

data "azurerm_client_config" "current" {
}

// Create Resource Group
resource "azurerm_resource_group" "rgroup" {
  name     = var.resource_group_name
  location = var.location
}

// Create Databricks
resource "azurerm_databricks_workspace" "databricks" {
  name                          = var.databricks_name
  location                      = azurerm_resource_group.rgroup.location
  resource_group_name           = azurerm_resource_group.rgroup.name
  sku                           = "premium"
}

// Databricks Provider
provider "databricks" {
  azure_workspace_resource_id = azurerm_databricks_workspace.databricks.id
  azure_client_id             = var.client_id
  azure_client_secret         = var.client_secret
  azure_tenant_id             = var.tenant_id
}

resource "databricks_cluster" "databricks_cluster" {
  depends_on              = [azurerm_databricks_workspace.databricks]
  cluster_name            = var.databricks_cluster_name
  spark_version           = "8.2.x-scala2.12"
  node_type_id            = "Standard_DS3_v2"
  driver_node_type_id     = "Standard_DS3_v2"
  autotermination_minutes = 15
  num_workers             = 5
  spark_env_vars          = {
    "PYSPARK_PYTHON" : "/databricks/python3/bin/python3"
  }
  spark_conf = {
    "spark.databricks.cluster.profile" : "serverless",
    "spark.databricks.repl.allowedLanguages": "sql,python,r"
  }
  custom_tags = {
    "ResourceClass" = "Serverless"
  }
}

GitHub Actions workflow with Databricks CLI implementation

deploydatabricksartifacts:
    needs: [terraform]
    name: 'Databricks Artifacts Deployment'
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2.3.4
   
    - name: Set up Python 3.0
      uses: actions/setup-python@v2
      with:
        python-version: 3.0

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip

    - name: Download Databricks CLI
      id: databricks_cli
      shell: pwsh
      run: |
        pip install databricks-cli
        pip install databricks-cli --upgrade

    - name: Azure Login
      uses: azure/login@v1
      with:
        creds: ${{ secrets.AZURE_CREDENTIALS }}
   
    - name: Databricks management
      id: api_call_databricks_manage
      shell: bash
      run: |
        # Set DataBricks AAD token env
        export DATABRICKS_AAD_TOKEN=$(curl -X GET -d "grant_type=client_credentials&client_id=${{ env.ARM_CLIENT_ID }}&resource=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d&client_secret=${{ env.ARM_CLIENT_SECRET }}" https://login.microsoftonline.com/${{ env.ARM_TENANT_ID }}/oauth2/token | jq -r ".access_token")

        # Log into Databricks with SPN
        databricks_workspace_url="https://${{ steps.get_databricks_url.outputs.DATABRICKS_URL }}/?o=${{ steps.get_databricks_url.outputs.DATABRICKS_ID }}"
        databricks configure --aad-token --host $databricks_workspace_url

        # Check if workspace notebook already exists
        export DB_WKSP=$(databricks workspace ls /${{ env.TF_VAR_databricks_notebook_name }})
        if [[ "$DB_WKSP" != *"RESOURCE_DOES_NOT_EXIST"* ]];
        then
          databricks workspace delete /${{ env.TF_VAR_databricks_notebook_name }} -r
        fi

        # Import DBC archive to Databricks Workspace
        databricks workspace import Databricks/${{ env.databricks_dbc_name }} /${{ env.TF_VAR_databricks_notebook_name }} -f DBC -l PYTHON

While the above example shows how to leverage Databricks CLI to do automation operations within Databricks, Terraform also provides richer capabilities with Databricks providers. Here is an example of how to add ‘service principal’ to Databricks ‘admins’ group in workspace using Terraform. This is essential for Databricks API to work when connecting as a service principal.

Databricks Creating Cluster
Databricks cluster deployed via Terraform
Jobs Deployed via Terraform
No Jobs have been deployed via Terraform
Databricks CLI
 Job deployed using Databricks CLI in GitHub Actions workflow
Deployment with Databricks
Job triggered via Databricks CLI in GitHub Actions workflow

Not just Terraform and Databricks CLI, but also Databricks API provides similar options to access Databricks artifacts and manage them. For example, to access the clusters in the Databricks:

  • To access clusters, first, authenticate if you are a workspace user via automation or using service principal.
  • If your service principal is already part of the workspaces admins group, use this API to get the clusters list.
  • If the service principal (SPN) is not part of the workspace, use this API that uses access and management tokens.
  • If you would rather add the service principal to Databricks admins workspace group, use this API (same as Terraform option above to add the SPN).

The secret scope in Databricks can be created using Terraform or using Databricks CLI or using Databricks API!

Databricks with other Azure resources have pretty good documentation, and for automating deployments, these options are essential: learn and use the best option that suits the needs!

Here is the link to my GitHub repo for complete code on using Terraform, Databricks CLI in GitHub Actions! In addition, you can find a bonus learning how to deploy synapse, ADLS, etc., as part of modern data warehouse deployment, which I will cover in my next blog post.

Until then, happy automating!

AIS is working with a large organization that wants to discover relationships between data and the business by iteratively integrating data from many sources into Azure Data Lake. The data will be analyzed by different groups within the organization to discover new factors that might affect the business. Data will then be published to the appropriate consumers using PowerBI.

In the initial phase, data lake ingests data from some of the Operational Systems. Eventually, data will be captured not only from all the organization’s systems but also from streaming data from IoT devices.  

Azure Data Lake 

Azure Data Lake allows us to store a vast amount of data of various types and structures. Data can be analyzed and transformed by Data Scientists and Data Engineers. 

The challenge with any data lake system is preventing it from becoming a data swamp. To establish an inventory of what is in a data lake, we capture the metadata such as origin, size, and content type during ingestion. We also have the Interface Control Document (ICD) from the Operational Systems that describe the data definition of the source data. 

Logical Zones

The data in the data lake is segregated into logical zones to allow logical and physical separation to keep the environment secure, organized and agile. As the data progress through the zones various transformation is performed. 

  • Landing Zone is a place where the original data files are stored untouched. No data is deleted from this zone, and access to this zone is limited.  
  • Raw Zone is a place where data quality validation is applied based on the rules defined in source ICDAny data filed validation moves to Error Zone. 
  • Curated Zone is a place where we store the cleansed and transformed data and ready for consumption. The transformation is done for different audiences, and within the Zone, folders will be created for each specialized change.  
  • Error Zone is a place where we store data that filed validation. A notification is sent to the registered data curators upon arriving new data.  
  • Metadata Zone is a place where we keep track of metadata of the source and the transformed data.Metadata Zone Organization

The source systems have security requirements that prevent access to sensitive data. When the folders are created, permissions are given to security groups in Azure Active Directory. The same security rules are applied to the subsequent folders.

Now that the data is in the data lake, we allow each consuming group to create their own transformation rules. The transformed data is then moved to the curated zone ready to be loaded to the Azure Data Warehouse.

Azure Data Factory

Azure Data Factory orchestrated the movement and transformation of data, as shown in the diagram below. When a file is dropped in the Landing Zone, the Azure Data Factory pipeline that consists of activities to Unzip, Validate, Transform, and Load the data into Data Warehouse.

The unzipping is performed by a custom code Azure Function activity rather than the copy activity’s decompress functionality. The out of box functionality of Azure Data Factory can be used to uncompressed only GZip, Deflate and BZip2 files but not Tar, Rar, 7Zip, Lzip.

The basic validation rules, such as data range, valid values, and reference data, are described in the ICD. A custom Azure Function activity was created to validate the incoming data.

Data is transformed using Spark activity in Azure Data Factory for each consuming user. Each consumer has a folder under the Curated Zone.

Data Processing Example

Tables in the Azure Data Warehouse were created based on the Curated zone by executing the Generate Azure Function activity to create data definition language (DDL). The script modifies the destination table if there is a new field added.

Finally, the data is copied to the destination tables to be used by end-users and warehouse designers.

In each step, we captured business, operational, and technical metadata to help us descript the data in the lake. The metadata information can be uploaded to a metadata management system in the future.