Recently, I was involved in a project where we had requirements. A report needed to be built based on data from several sources, including Azure Cosmos DB and Azure SQL. This report was required to be stored in Azure, and data was subject to a retention policy where it would be unable to be deleted. For this use case, we decided to use Azure Data Factory to read the data, transform the data, and load the data into a CSV file within a blob container in an Azure Storage Account. Then, to prevent the deletion of data, we would create a retention policy on the blob container to enable immutable storage, giving us write-once, read-many capabilities.

Azure Cosmos Data Flow

We found that we needed a Data Flow within Azure Data Factory to perform logic such as joining across our data sources. We created an output sink to a Delimited Text output within an Azure Blob Storage Linked Service within our data flow.

Immutable storage policies divided into retention policies and legal holds can be enabled on a storage account to enforce write-once, read-many policies, which allow new documents to be added to the storage account. But, it prevents the modification and deletion of existing documents. In the case of a retention policy, individual blobs cannot be modified or deleted until the retention period has passed for a given blob. In the case of legal holds, no blobs can be modified or deleted until the legal hold is manually removed from the blob container. Legal holds can be used to fulfill various legal requirements, including keeping data under active investigation from being modified. You can read more about immutable storage in the official Microsoft documentation for Immutable storage for blobs.

Our project had legal requirements for retaining these created documents, so we found immutable storage policies to be a good fit. When we enabled Immutable Storage, either through a retention policy or a legal hold on the Blob Container, we encountered one of two errors:

Job failed due to reason: Spark job failed in one of the cluster nodes while writing data in one partition to sink, with following error message: This operation is not permitted as the blob is immutable due to a policy.

OR

Job failed due to reason: Spark job failed in one of the cluster nodes while writing data in one partition to sink, with the following error message: This operation is not permitted as the blob is immutable due to one or more legal holds.

Here, we will examine why this error occurs, and how we can resolve these errors.

Setup

To simplify our example, I will create a simple Azure Data Factory pipeline that uses a data flow to read from an Azure SQL server, performs a Select operation on that data, and then output a CSV file into an Azure Storage Blob Container.

Data Flow pipeline

I created an Azure SQL server to hold our source data. This server has a database named testdb, and a table named MOCK_DATA with a few columns of varying data types. I then inserted some mock data into the table.

SERVICE OFFERING: REHOST ON AZURE
Initiate your cloud journey on Microsoft technologies to accelerate your cloud transformation. Learn more about the benefits of rehosting.

I then created an Azure Storage Account to hold our final data, with a blob container named “output.” This is where I will store the final CSV files output by the Azure Data Factory.

Azure Storage Account

To make these blobs immutable, I added a legal hold to the output container:

Make blobs immutable

I then created a new Azure Data Factory:

Create and Open Azure Data Factory

Opening Azure Data Factory Studio, I created the following. A new Linked Service, associated with my Azure SQL Server:

Create New Linked Service associated with Azure SQL Server

A second Linked Service, associated with Azure Blob Storage:

Associate Linked Service with Azure Blob Storage

An Azure SQL Database Dataset, with the linked service associated to my SQL Database and the table source associated with my mock data table:

Azure SQL Dataset

An Azure Blob Storage Delimited Text Dataset, associated with my Blob Storage Linked Service, and with two parameters, Container and Blob, that are associated with the File path:

Azure Blob Storage Delimited Text Dataset

A Dataflow, which has the following configuration:

  • A parameter named filename
  • A Source associated with the mock data table dataset, with all default options.
  • A Select activity to simulate processing the data to select specific columns
  • A Sink activity with the following settings:
    • Dataset: The delimited text dataset
    • Settings tab:
      • File name option: Output to a single file
      • File name: $filename
      • Optimize: Single partition

Data Flow Activity and Configuration

A Pipeline with a Data Flow activity, associated with the Data Flow that was just built, with the following parameters:

  • Sink parameters:
    • Container: output
    • Blob: A dynamic content value with an empty string, ”
  • Other parameters:
    • filename: Data flow expression, “file1.csv”

Main Data Flow Parameters

When I run this pipeline, I get the following error:

Job failed due to reason: Spark job failed in one of the cluster nodes while writing data in one partition to sink, with the following error message: This operation is not permitted as the blob is immutable due to one or more legal holds.

Solution

Why do we get an error when running this pipeline? When the Spark data cluster associated with the Data Flow activity runs, it creates several temporary files within the Blob Container. It then attempts to append to these files and delete them after processing.

Error when running pipeline

This violates the write-once, read-many restrictions on the storage blob container so that the Data Factory job will fail.

To resolve this issue, I will need to create a temporary location for these files to be built and then copy that data from one storage account to another. First, I will create a new blob storage container with no retention restrictions, which I will call “temp”:

Temporary Blob Storage Container

Then, on the storage account, to clean up the temporary data, I will enable lifecycle management. I created a new lifecycle management rule on the storage account to delete old files by providing the following settings:

  • Rule scope: Limit blobs with filters
  • Blob type: Block blobs
  • Blob subtype: Base blobs
  • Base blobs rules:
    • Add a rule: If base blobs haven’t been modified in 1 day, then delete the blobs
  • Filter set:
    • Blob prefix: temp/*.csv

Base Blobs Subtype

Create the Filter Set

Next, back in Azure Data Factory, in my pipeline, I changed the dataflow sink parameters to use the temp blob container rather than the output blob container:

Change parameters in dataflow pipeline

The Copy Data activity does not work the same way as a Data Flow sink and does not create temporary files in its process. I created a new Copy Data activity in the pipeline, and attached the data flow’s anchor to the copy data activity:

Create New Data Activity in Pipeline

For the Source of the Copy Data activity, I selected the Delimited Text dataset and provided the following settings:

  • Container: temp
  • Blob: file1.csv

For the Sink of the Copy Data activity, I selected the same Delimited Text dataset and provided the following settings:

  • Container: output
  • Blob: file1.csv

Rerun the pipeline. The pipeline should run successfully, and your output file will exist in both the temp and output blob containers. After our lifecycle management policies find that the temporary blob is over a day old, the temporary blob will be deleted, while the output blob will remain. The retention policy will prevent the output blob from being deleted.

Run Pipeline with output and temp blob containers

Summary

The way the Data Flow activity in Azure Data Factory uses temporary files to build delimited text outputs, using a delimited text output as a Data Sink is not possible on an Azure Storage Blob Container with an immutable storage policy. A temporary storage output destination must exist as the output for the data flow, which is then copied over to the final, firm storage destination with a Copy Data activity to work around this restriction. The Copy Data activity is not processed the same way as the Data Flow sink. It does not create temporary files, so it is suitable for copying data from the temporary container to the output container. The temporary container is then cleaned regularly by creating a lifecycle management policy on the storage account.

Bridging the gap between serverless OLTP and Analytics

Cosmos DB is unabashedly a document-based Online Transactional Processing (OLTP) system. It was engineered at its core to provide low latency, high throughput transactions with service level agreements (SLAs), consistency models, and guarantees to back it. It does this exceptionally well, but there are trade-offs for these architectural decisions. Other databases can do like complex queries with joins and aggregates or set-based operations. Cosmos DB, by design, doesn’t have a native solution or is too resource prohibitive. Capabilities in Cosmos DB continue to evolve, but at a certain point requires another solution that is better suited for the challenges. One of these gaps has now been bridged with Azure Synapse Link for Cosmos DB.

Azure Synapse is Microsoft’s consolidated data analytics platform that brings together data ingestion, transformation, machine learning (ML) training, testing, management, security, monitoring, and visualization all in one place. With Synapse Link, Cosmos DB data can participate in this eco-system.

Under the hood, Cosmos DB data is then replicated from its row-based index store to a column-based index store that sits on top of Azure Data Lake. These files are stored in a read optimized format while the process is fully managed and is enabled by a checkbox and a few options. Because the data is replicated, there is no impact on transactional workloads against Cosmos DB, but there is a delay. There is currently an up to 5-minute replication period, but this time is much lower in practice.

CosmosDb OLTP

The analytical storage is decoupled from the analytical compute systems, so as other compute options become available, the data doesn’t need to be replicated. This also allows for multi-use scenarios like Apache Spark structured event for streaming or traditional data warehousing. Azure Synapse also provides a Serverless SQL compute engine that can read the replicated data.

Previously, these features were possible but required the use of Cosmos DB change feed or direct queries to move the data to another store. With Azure Synapse Link for Azure Cosmos DB, analytics has gone serverless and cloud-native!

I just returned from Microsoft BUILD 2019 where I presented a session on Azure Kubernetes Services (AKS) and Cosmos. Thanks to everyone who attended. We had excellent attendance – the room was full! I like to think that the audience was there for the speaker 😊 but I’m sure the audience interest is a clear reflection of how popular AKS and Cosmos DB are becoming.

For those looking for a 2-minute overview, here it is:

In a nutshell, the focus was to discuss the combining Cloud-Native Service (like AKS) and a Managed Database

Microsoft Build Session Architecting Cloud-Native Apps with AKS and Cosmos DB Slide Deck

We started with a discussion of Cloud-Native Apps, along with a quick introduction to AKS and Cosmos. We quickly transitioned into stateful app considerations and talked about new stateful capabilities in Kubernetes including PV, PVC, Stateful Sets, CSI, and Operators. While these capabilities represent significant progress, they don’t match up with external services like Cosmos DB.

Microsoft Build Session Architecting Cloud-Native Apps with AKS and Cosmos DB Slide Deck

Microsoft Build Session Architecting Cloud-Native Apps with AKS and Cosmos DB Slide Deck Cloud Native Tooling

One option is to use Open Service Broker – It allows Kubernetes hosted services to talk to external services using cloud-native tooling like svcat (Service Catalog).

Microsoft Build Session Architecting Cloud-Native Apps with AKS and Cosmos DB Slide Deck svcat

Microsoft Build Session Architecting Cloud-Native Apps with AKS and Cosmos DB Slide Deck SRE

External services like Cosmos DB can go beyond cluster SRE and offer “turn-key” SRE in essence – Specifically, geo-replication, API-based scaling, and even multi-master writes (eliminating the need to failover).

Microsoft Build Session Architecting Cloud-Native Apps with AKS and Cosmos DB Slide Deck Mutli Master Support

Microsoft Build Session Architecting Cloud-Native Apps with AKS and Cosmos DB Slide Deck Configure Regions

Microsoft Build Session Architecting Cloud-Native Apps with AKS and Cosmos DB Slide Deck Portability

Since the Open Service Broker is an open specification, your app remains mostly portable even when you move to one cloud provider to another. OpenService Broker does not deal with syntactic differences, say connection string prefix difference between cloud providers.  One way to handle these differences is to use Helm.

Learn more about my BUILD session:

Here you can find the complete recording of the session and slide deck: https://mybuild.techcommunity.microsoft.com/sessions/77138?source=sessions#top-anchor

Additionally, you can find the code for the sample I used here: https://github.com/vlele/build2019 

WORK WITH THE BRIGHTEST LEADERS IN SOFTWARE DEVELOPMENT

Recently I had an opportunity to sit down with Steve Michelotti, Program Manager on the AzureGov team and talk about a Machine Learning (ML) application we built for a federal agency. This application is a great example of how AIS leverages the latest innovations on the AzureGov platform to build applications that align with agencies’ missions – and go beyond IT support to directly assist in meeting the mission objectives.

Specifically, this application was designed to help analysts get personalized recommendations (based on their own preference settings, ratings provided by their co-workers) for stories they need to analyze as part of their daily work.

Brent Wodicka from AIS described this application in an earlier blog post. Read More…

We took Rimma Nehme’s excellent demo from BUILD 2017 and recreated it for AzureGov.

In a nutshell, we took the Marvel Universe Social Database and loaded it in Azure Cosmos DB as a graph database. Then we built a simple web page that invoked Gremlin queries against Cosmos DB.

The key theme of this demo is the ease with which you can create a globally distributed database that can support low latency queries against the Marvel Universe graph database. In the context of AzureGov (as shown below), we can seamlessly replicate the data across the three AzureGov regions by clicking on these regions within the Azure portal.

Here’s a quick look at the demo: