Recently I was working on a project for a DoD client looking to move multiple, siloed on-premise data workloads to the cloud as part of an Azure capabilities proof of concept. The client needed to upload large amounts of data, create parquet files, perform ad-hoc analysis using jupyter notebooks, and make data available to query using SQL for a reporting dashboard. A single dataset was approximately a terabyte with all the data measured in petabytes.

After completing the analysis, we ended up with many files staged in a data lake. Due to the considerable amount of data we were expecting in the future, we didn’t want to pay to store this data twice spread over multiple databases. We opted to take advantage of Azure Synapse and Polybase to directly query parquet files in the data lake using external tables[i]. We ended up with the following data processing flow:

Azure Synapse

When setting up the parquet files to be queried as an external table, some of them had many fields (200+), which led to numerous errors and quickly became very tedious. In addition, due to the nature of the project, numerous tables were often created. To avoid manually creating the tables, we looked for a solution to automatically create the external tables but could not find an existing solution. So it was up to me to create a simple solution that would work in our current environment without adding additional resources.

The solution was to leverage Data Factories Get Metadata activity[ii] while moving the data into the staging directories. The activity could give us the schema of the parquet files as a JSON string. I then take this JSON schema and pass it to a stored procedure on the synapse pool that would parse the JSON and insert the fields into a table I could use later:

WITH Fields (fieldOrder, fieldName, fieldType) AS (
  SELECT
    [ key ] AS fieldOrder,
    JSON_VALUE([ value ], 'strict $.name') AS fieldName,
    JSON_VALUE([ value ], 'strict $.type') AS fieldType
  FROM
    OPENJSON(@ schema)
)
INSERT INTO
  tables_to_create(
    tableName,
    fieldOrder,
    fieldName,
    fieldType,
    executeTime
  )
SELECT
  @tableName,
  fieldOrder,
  fieldName,
  fieldType,
  translatedType = CASE
    WHEN fieldType = 'Single' THEN 'real'
    WHEN fieldType = 'Boolean' THEN 'bit'
    WHEN fieldType = 'Double' THEN 'float'
    WHEN fieldType = 'Int64' THEN 'bigint'
    ELSE NULL
  END
  @ExecTime
FROM
  Fields

Then build a SQL command to check the existence of the table and then create it if it doesn’t exist:

SET
  @sqlCommand = 'IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N''[dbo].[' + @tableName + ']'') AND type in (N''U''))
  CREATE EXTERNAL TABLE [dbo].[' + @tableName + '] (' 

WHILE((SELECT COUNT(*) FROM tables_to_create WHERE executeTime = @ExecTime) > 0)
BEGIN
  DECLARE @key int
  SELECT
    @key = MIN(fieldOrder)
  FROM
    tables_to_create
  WHERE
    executeTime = @ExecTime
    
  DECLARE @fieldName VARCHAR(50)  
  DECLARE @translatedType VARCHAR(50)

  SELECT
    @fieldName = fieldName,
    @translatedType = translatedType
  FROM
    tables_to_create
  WHERE
    fieldOrder = @key
    AND executeTime = @ExecTime

  SET
    @sqlCommand = @sqlCommand + '
          [' + @fieldName + '] [' + @translatedType + '] NULL'

  DELETE FROM
    tables_to_create
  WHERE
    fieldOrder = @key
    AND executeTime = @ExecTime

  IF((SELECT COUNT(*) FROM tables_to_create WHERE executeTime = @ExecTime) > 0)
    SET
      @sqlCommand = @sqlCommand + ', '
END

SET
  @sqlCommand = @sqlCommand + '
  )
  WITH
  (
    LOCATION = ''/' + @folderPath + ''',
    DATA_SOURCE = DataLakeStaged,
    FILE_FORMAT = StagedParquet
  )'
  
EXEC(@sqlCommand)

This frees up the analyst from needing to manually create the external tables and know the mapping in the data factory to point to the correct location on the data lake. The analysts need to worry about making sure the name and path conventions we set up for syncing don’t land different schemas in the same folder.

Resources

[i] https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=sql-pool#external-tables-in-dedicated-sql-pool-and-serverless-sql-pool

[ii] https://docs.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity#:~:text=Supported%20connectors%20%20%20%20Connector%2FMetadata%20%20,%20%20x%20%205%20more%20rows%20

Azure Data Lake logoFirst Things First…What’s a Data Lake?

If you’re not already familiar with the term, a “data lake” is generally defined as an expansive collection of data that’s held in its original format until needed. Data lakes are repositories of raw data, collected over time, and intended to grow continually. Any data that’s potentially useful for analysis is collected from both inside and outside your organization, and is usually collected as soon as it’s generated. This helps ensure that the data is available and ready for transformation and analysis when needed. Data lakes are central repositories of data that can answer business questions…including questions you haven’t thought of yet.

Azure Data Lake

Azure Data Lake is actually a pair of services: The first is a repository that provides high-performance access to unlimited amounts of data with an optional hierarchical namespace, thus making that data available for analysis. The second is a service that enables batch analysis of that data. Azure Data Lake Storage provides the high performance and unlimited storage infrastructure to support data collection and analysis, while Azure Data Lake Analytics provides an easy-to-use option for an on-demand, job-based, consumption-priced data analysis engine.

We’ll now take a closer look at these two services and where they fit into your cloud ecosystem. Read More…