Recently I was working on a project for a DoD client looking to move multiple, siloed on-premise data workloads to the cloud as part of an Azure capabilities proof of concept. The client needed to upload large amounts of data, create parquet files, perform ad-hoc analysis using jupyter notebooks, and make data available to query using SQL for a reporting dashboard. A single dataset was approximately a terabyte with all the data measured in petabytes.

After completing the analysis, we ended up with many files staged in a data lake. Due to the considerable amount of data we were expecting in the future, we didn’t want to pay to store this data twice spread over multiple databases. We opted to take advantage of Azure Synapse and Polybase to directly query parquet files in the data lake using external tables[i]. We ended up with the following data processing flow:

Azure Synapse

When setting up the parquet files to be queried as an external table, some of them had many fields (200+), which led to numerous errors and quickly became very tedious. In addition, due to the nature of the project, numerous tables were often created. To avoid manually creating the tables, we looked for a solution to automatically create the external tables but could not find an existing solution. So it was up to me to create a simple solution that would work in our current environment without adding additional resources.

The solution was to leverage Data Factories Get Metadata activity[ii] while moving the data into the staging directories. The activity could give us the schema of the parquet files as a JSON string. I then take this JSON schema and pass it to a stored procedure on the synapse pool that would parse the JSON and insert the fields into a table I could use later:

WITH Fields (fieldOrder, fieldName, fieldType) AS (
  SELECT
    [ key ] AS fieldOrder,
    JSON_VALUE([ value ], 'strict $.name') AS fieldName,
    JSON_VALUE([ value ], 'strict $.type') AS fieldType
  FROM
    OPENJSON(@ schema)
)
INSERT INTO
  tables_to_create(
    tableName,
    fieldOrder,
    fieldName,
    fieldType,
    executeTime
  )
SELECT
  @tableName,
  fieldOrder,
  fieldName,
  fieldType,
  translatedType = CASE
    WHEN fieldType = 'Single' THEN 'real'
    WHEN fieldType = 'Boolean' THEN 'bit'
    WHEN fieldType = 'Double' THEN 'float'
    WHEN fieldType = 'Int64' THEN 'bigint'
    ELSE NULL
  END
  @ExecTime
FROM
  Fields

Then build a SQL command to check the existence of the table and then create it if it doesn’t exist:

SET
  @sqlCommand = 'IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N''[dbo].[' + @tableName + ']'') AND type in (N''U''))
  CREATE EXTERNAL TABLE [dbo].[' + @tableName + '] (' 

WHILE((SELECT COUNT(*) FROM tables_to_create WHERE executeTime = @ExecTime) > 0)
BEGIN
  DECLARE @key int
  SELECT
    @key = MIN(fieldOrder)
  FROM
    tables_to_create
  WHERE
    executeTime = @ExecTime
    
  DECLARE @fieldName VARCHAR(50)  
  DECLARE @translatedType VARCHAR(50)

  SELECT
    @fieldName = fieldName,
    @translatedType = translatedType
  FROM
    tables_to_create
  WHERE
    fieldOrder = @key
    AND executeTime = @ExecTime

  SET
    @sqlCommand = @sqlCommand + '
          [' + @fieldName + '] [' + @translatedType + '] NULL'

  DELETE FROM
    tables_to_create
  WHERE
    fieldOrder = @key
    AND executeTime = @ExecTime

  IF((SELECT COUNT(*) FROM tables_to_create WHERE executeTime = @ExecTime) > 0)
    SET
      @sqlCommand = @sqlCommand + ', '
END

SET
  @sqlCommand = @sqlCommand + '
  )
  WITH
  (
    LOCATION = ''/' + @folderPath + ''',
    DATA_SOURCE = DataLakeStaged,
    FILE_FORMAT = StagedParquet
  )'
  
EXEC(@sqlCommand)

This frees up the analyst from needing to manually create the external tables and know the mapping in the data factory to point to the correct location on the data lake. The analysts need to worry about making sure the name and path conventions we set up for syncing don’t land different schemas in the same folder.

Resources

[i] https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=sql-pool#external-tables-in-dedicated-sql-pool-and-serverless-sql-pool

[ii] https://docs.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity#:~:text=Supported%20connectors%20%20%20%20Connector%2FMetadata%20%20,%20%20x%20%205%20more%20rows%20

Bridging the gap between serverless OLTP and Analytics

Cosmos DB is unabashedly a document-based Online Transactional Processing (OLTP) system. It was engineered at its core to provide low latency, high throughput transactions with service level agreements (SLAs), consistency models, and guarantees to back it. It does this exceptionally well, but there are trade-offs for these architectural decisions. Other databases can do like complex queries with joins and aggregates or set-based operations. Cosmos DB, by design, doesn’t have a native solution or is too resource prohibitive. Capabilities in Cosmos DB continue to evolve, but at a certain point requires another solution that is better suited for the challenges. One of these gaps has now been bridged with Azure Synapse Link for Cosmos DB.

Azure Synapse is Microsoft’s consolidated data analytics platform that brings together data ingestion, transformation, machine learning (ML) training, testing, management, security, monitoring, and visualization all in one place. With Synapse Link, Cosmos DB data can participate in this eco-system.

Under the hood, Cosmos DB data is then replicated from its row-based index store to a column-based index store that sits on top of Azure Data Lake. These files are stored in a read optimized format while the process is fully managed and is enabled by a checkbox and a few options. Because the data is replicated, there is no impact on transactional workloads against Cosmos DB, but there is a delay. There is currently an up to 5-minute replication period, but this time is much lower in practice.

CosmosDb OLTP

The analytical storage is decoupled from the analytical compute systems, so as other compute options become available, the data doesn’t need to be replicated. This also allows for multi-use scenarios like Apache Spark structured event for streaming or traditional data warehousing. Azure Synapse also provides a Serverless SQL compute engine that can read the replicated data.

Previously, these features were possible but required the use of Cosmos DB change feed or direct queries to move the data to another store. With Azure Synapse Link for Azure Cosmos DB, analytics has gone serverless and cloud-native!