Recently I was working on a project for a DoD client looking to move multiple, siloed on-premise data workloads to the cloud as part of an Azure capabilities proof of concept. The client needed to upload large amounts of data, create parquet files, perform ad-hoc analysis using jupyter notebooks, and make data available to query using SQL for a reporting dashboard. A single dataset was approximately a terabyte with all the data measured in petabytes.

After completing the analysis, we ended up with many files staged in a data lake. Due to the considerable amount of data we were expecting in the future, we didn’t want to pay to store this data twice spread over multiple databases. We opted to take advantage of Azure Synapse and Polybase to directly query parquet files in the data lake using external tables[i]. We ended up with the following data processing flow:

Azure Synapse

When setting up the parquet files to be queried as an external table, some of them had many fields (200+), which led to numerous errors and quickly became very tedious. In addition, due to the nature of the project, numerous tables were often created. To avoid manually creating the tables, we looked for a solution to automatically create the external tables but could not find an existing solution. So it was up to me to create a simple solution that would work in our current environment without adding additional resources.

The solution was to leverage Data Factories Get Metadata activity[ii] while moving the data into the staging directories. The activity could give us the schema of the parquet files as a JSON string. I then take this JSON schema and pass it to a stored procedure on the synapse pool that would parse the JSON and insert the fields into a table I could use later:

WITH Fields (fieldOrder, fieldName, fieldType) AS (
  SELECT
    [ key ] AS fieldOrder,
    JSON_VALUE([ value ], 'strict $.name') AS fieldName,
    JSON_VALUE([ value ], 'strict $.type') AS fieldType
  FROM
    OPENJSON(@ schema)
)
INSERT INTO
  tables_to_create(
    tableName,
    fieldOrder,
    fieldName,
    fieldType,
    executeTime
  )
SELECT
  @tableName,
  fieldOrder,
  fieldName,
  fieldType,
  translatedType = CASE
    WHEN fieldType = 'Single' THEN 'real'
    WHEN fieldType = 'Boolean' THEN 'bit'
    WHEN fieldType = 'Double' THEN 'float'
    WHEN fieldType = 'Int64' THEN 'bigint'
    ELSE NULL
  END
  @ExecTime
FROM
  Fields

Then build a SQL command to check the existence of the table and then create it if it doesn’t exist:

SET
  @sqlCommand = 'IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N''[dbo].[' + @tableName + ']'') AND type in (N''U''))
  CREATE EXTERNAL TABLE [dbo].[' + @tableName + '] (' 

WHILE((SELECT COUNT(*) FROM tables_to_create WHERE executeTime = @ExecTime) > 0)
BEGIN
  DECLARE @key int
  SELECT
    @key = MIN(fieldOrder)
  FROM
    tables_to_create
  WHERE
    executeTime = @ExecTime
    
  DECLARE @fieldName VARCHAR(50)  
  DECLARE @translatedType VARCHAR(50)

  SELECT
    @fieldName = fieldName,
    @translatedType = translatedType
  FROM
    tables_to_create
  WHERE
    fieldOrder = @key
    AND executeTime = @ExecTime

  SET
    @sqlCommand = @sqlCommand + '
          [' + @fieldName + '] [' + @translatedType + '] NULL'

  DELETE FROM
    tables_to_create
  WHERE
    fieldOrder = @key
    AND executeTime = @ExecTime

  IF((SELECT COUNT(*) FROM tables_to_create WHERE executeTime = @ExecTime) > 0)
    SET
      @sqlCommand = @sqlCommand + ', '
END

SET
  @sqlCommand = @sqlCommand + '
  )
  WITH
  (
    LOCATION = ''/' + @folderPath + ''',
    DATA_SOURCE = DataLakeStaged,
    FILE_FORMAT = StagedParquet
  )'
  
EXEC(@sqlCommand)

This frees up the analyst from needing to manually create the external tables and know the mapping in the data factory to point to the correct location on the data lake. The analysts need to worry about making sure the name and path conventions we set up for syncing don’t land different schemas in the same folder.

Resources

[i] https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=sql-pool#external-tables-in-dedicated-sql-pool-and-serverless-sql-pool

[ii] https://docs.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity#:~:text=Supported%20connectors%20%20%20%20Connector%2FMetadata%20%20,%20%20x%20%205%20more%20rows%20

brainIf you’ve spent any time here at the AIS Blog, you already know how passionate we are about AI and Machine Learning. (We co-wrote the Playbook on it!) Many believe AI will also advance government; changing the speed, reliability, and ability for agencies to make more informed decisions and better deliver on their missions. The key to unlock the power of AI are the massive amounts of data that flood many agencies…but are often underutilized or even locked up due to lack of resources, knowledge of tools, and compliance.

Fortunately, cloud-based analytics capabilities allow this data to be turned into knowledge that can empower agencies to take advantage of AI capabilities that can help agencies make more informed decisions, drive efficiency and become more citizen friendly. Last night’s #AzureGovMeetup brought in a great lineup of industry experts to cover best practices and challenges for government with AI and analytics, top solutions and hands-on demos, and to contrast how the government is using AI today with their vision for the future.

ICYMI: Watch the full archived livestream here!

This month’s Meetup was standing room only, so be sure to RSVP for the May Meetup ASAP! These free events are open to the public and offer great opportunities for both learning and networking. We hope to see you there!

Follow the @AISTeam on Twitter and join the conversation about the #AzureGovMeetup.

Government agencies across the spectrum are sitting on invaluable datasets that are locked up, usually because of lack of resources, tools and compliance regimes.

Fortunately, cloud-based analytics capabilities are allowing these datasets to be turned into knowledge. Whether it’s data about the agency mission, citizen services or simply data from running large data centers, analytics can help these agencies visualize, analyze and optimize. Last night’s AzureGovMeetup showed attendees exactly how.

Read More…

WindowsAzureAs more and more businesses move their applications to the cloud, it’s clear that operation and log data analysis is a major component to the migrating process. That data is crucial to understanding the health and reliability of your cloud services with respect to scalability, resilience, uptime, and your ability to troubleshoot issues.

But how do you deal with all that operational data? How do you identify specific application issues such as exceptions raised from bugs in the code, troubling increases in processor or memory consumption, or slow response times?

It turns out that migrating your applications to the cloud is just the first step: Having a well-thought-out operational data and monitoring user story is just as important. Read More…

Data Analytics, Cortana Intelligence Suite, Data Visualization The Cortana Intelligence Suite is a fully managed data analytics platform that offers a range of services aimed at helping businesses find meaning in all of the data they are collecting. This post will dive into the services that are offered by the platform. Prior to discussing these services in detail, let’s take a quick lap around data analytics specific to goals, what to expect when building and maturing a competency for your business, and challenges businesses experience while on this journey. After laying this groundwork, we’ll discuss the Cortana Intelligence Suite in a bit more detail and provide a list of the services offered by the platform. Additionally, we’ll cover how the Cortana Intelligence Suite can offer the tools needed to implement a robust strategy to address the key challenge areas we’ve described. Read More…