Databricks provides a robust notebook environment that is excellent for ad-hoc and interactive access to data. However, it lacks robust software development tooling. Databricks Connect and Visual Studio (VS) Code can help bridge the gap. Once configured, you use the VS Code tooling like source control, linting, and your other favorite extensions and, at the same time, harness the power of your Databricks Spark Clusters.

Configure Databricks Cluster

Your Databricks cluster must be configured to allow connections.

  1. In the Databricks UI edit your cluster and add this/these lines to the spark.conf:
    spark.databricks.service.server.enabled true
    spark.databricks.service.port 8787
  2. Restart Cluster

Configure Local Development Environment

The following instructions are for Windows, but the tooling is cross-platform and will work wherever Java, Python, and VSCode will run.

  • Install Java JDK 8 (enable option to set JAVA_HOME) – https://adoptopenjdk.net/
  • Install Miniconda (3.7, default options) – https://docs.conda.io/en/latest/miniconda.html
  • From the Miniconda prompt run (follow prompts):Note: Python and databricks-connect library must match the cluster version. Replace {version-number} with version i.e. Python 3.7, Databricks Runtime 7.3
    “` cmd
    conda create –name dbconnect python={version-number}
    conda activate dbconnect
    pip install -U databricks-connect=={version-number}
    databricks-connect configure
    “`
  • Download http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe to C:\Hadoop
    From command prompt run:

    “` cmd
    setx HADOOP_HOME “C:\Hadoop\” /M
    “`
  • Test Databricks connect. In the Miniconda prompt run:
    “` cmd
    databricks-connect test
    “`

You should see an “* All tests passed.” if everything is configured correctly.

  • Install VSCode and Python Extension
    •  https://code.visualstudio.com/docs/python/python-tutorial
    • Open Python file and select “dbconnect” interpreter in lower toolbar of VSCode
  • Activate Conda environment in VSCode cmd terminal
    From VSCode Command Prompt:
    This only needs to be run once (replace username with your username):
    “` cmd
    C:\Users\{username}\Miniconda3\Scripts\conda init cmd.exe
    “`
    Open new Cmd terminal
    “` cmd
    conda activate dbconnect
    “`
    Optional: You can run the command ` databricks-connect test` from Step 5 to insure the Databricks connect library is configured and working within VSCode.

Run Spark commands on Databricks cluster

You now have VS Code configured with Databricks Connect running in a Python conda environment. You can use the below code to get a Spark session and any dependent code will be submitted to the Spark cluster for execution.

“` python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
“`
Once a context is established, you can interactively send commands to the cluster by selecting them and right-click “Run Select/Line in Python Interactive Window” or by pressing Shift+Enter.

Context established to send commandsThe results of the command executed on the cluster will display in the Visual Studio Code Terminal. Commands can also be executed from the command line window.

Executed Command Cluster

Summary

To recap, we set up a Python virtual environment with Miniconda and installed the dependencies required to run Databricks Connect. We configured Databricks Connect to talk to our hosted Azure Databricks Cluster and setup Visual Studio code to use the conda command prompt to execute code remotely. Now that you can develop locally in VS Code, all its robust developer tooling can be utilized to build a more robust and developer-centric solution.

Bridging the gap between serverless OLTP and Analytics

Cosmos DB is unabashedly a document-based Online Transactional Processing (OLTP) system. It was engineered at its core to provide low latency, high throughput transactions with service level agreements (SLAs), consistency models, and guarantees to back it. It does this exceptionally well, but there are trade-offs for these architectural decisions. Other databases can do like complex queries with joins and aggregates or set-based operations. Cosmos DB, by design, doesn’t have a native solution or is too resource prohibitive. Capabilities in Cosmos DB continue to evolve, but at a certain point requires another solution that is better suited for the challenges. One of these gaps has now been bridged with Azure Synapse Link for Cosmos DB.

Azure Synapse is Microsoft’s consolidated data analytics platform that brings together data ingestion, transformation, machine learning (ML) training, testing, management, security, monitoring, and visualization all in one place. With Synapse Link, Cosmos DB data can participate in this eco-system.

Under the hood, Cosmos DB data is then replicated from its row-based index store to a column-based index store that sits on top of Azure Data Lake. These files are stored in a read optimized format while the process is fully managed and is enabled by a checkbox and a few options. Because the data is replicated, there is no impact on transactional workloads against Cosmos DB, but there is a delay. There is currently an up to 5-minute replication period, but this time is much lower in practice.

CosmosDb OLTP

The analytical storage is decoupled from the analytical compute systems, so as other compute options become available, the data doesn’t need to be replicated. This also allows for multi-use scenarios like Apache Spark structured event for streaming or traditional data warehousing. Azure Synapse also provides a Serverless SQL compute engine that can read the replicated data.

Previously, these features were possible but required the use of Cosmos DB change feed or direct queries to move the data to another store. With Azure Synapse Link for Azure Cosmos DB, analytics has gone serverless and cloud-native!

Relational database source control, versioning, and deployments have notoriously been challenging. Each instance of the database (Dev, Test, Production) can contain different data, may be upgraded at different times, and are generally not in a consistent state. This is known as database drift.

Traditional Approach and Challenges

Traditionally, to move changes between each instance, a one-off “state-based” comparison is done either directly between the database or against a common state like a SQL Server Data Tools project. This yields a script that has no direct context to changes being deployed and requires a tremendous effort to review to ensure that only the intent of the changes being promoted/rolled back is included. This challenge sometimes leads to practices like backing up a “known good copy” aka production and restoring it to lower tiers. For any but the smallest applications and teams, this raises even more challenges like data governance and logistics around test data. These patterns can be automated, but generally do not embrace the spirit of continuous integration and DevOps.

State Based

For example, the above three changes could be adding a column, then adding data to a temporary table, and the third populating the new column with the data from the temporary table. In this scenario it isn’t only important that a new column was added, it is also how the data was added. The context of the change is lost and trying to derive it from the final state of the database is too late in the process.

DevOps Approach

Architecturally, application persistence (a database) is an aspect or detail of an application, so we should treat it as part of our application. We use continuous integration builds to compile source code into artifacts and promote them through environments. Object-Relational Mapping (ORM) Frameworks like Entity Framework and Ruby on Rails have paved the way out with a “migrations” change-based approach out of necessity. This same concept can be used for just the schema with projects like FluentMigrator. At development time the schema changes to upgrade and rollback are expressed in the framework or scripted DDL and captured in source control. They are compiled and included in the deployment artifact. When the application invokes a target database, it identifies the current version and applies any changes up or down sequentially to provide deterministic version compatibility. The application is in control of the persistence layer, not the other way around. It also forces developers to work through the logistics (operations) of applying the change. This is the true essence of DevOps.

Migration Scripts

In the same example above, the three changes would be applied to each database in the same sequence and the intent of the change would be captured in both.

Summary

In summary, a migrations-based approach lends itself to a DevOps culture. It may take some additional effort up front to work through and capture how changes should be applied, but it allows all aspects of the database deployment process to be tested throughout a project lifecycle. This promotes repeatability and ultimately the confidence needed to perform frequent releases.

Data Lake has become a mainstay in data analytics architectures. By storing data in its native format, it allows organizations to defer the effort of structuring and organizing data upfront. This promotes data collection and serves as a rich platform for data analytics. Most data lakes are also backed by a distributed file system that enables massively parallel processing (MPP) and scales with even the largest of data sets. The increase of data privacy regulations and demands on governance requires a new strategy. Simple tasks such as finding, updating or deleting a record in a data lake can be difficult. It requires an understanding of the data and typically involves an inefficient process that includes re-writing the entire data set. This can lead to resource contention and interruptions in critical analytics workloads.

Apache Spark has become one of the most adopted data analytics platforms. Earlier this year, the largest contributor, Databricks, open-sourced a library called Delta Lake. Delta Lake solves the problem of resource contention and interruption by creating an optimized ACID-compliant storage repository that is fully compatible with the Spark API and sits on top of your existing data lake. Files are stored in Parquet format which makes them portable to other analytics workloads. Optimizations like partitioning, caching and data skipping are built-in so additional performance gains will be realized over native formats.

DeltaLake is not intended to replace a traditional domain modeled data warehouse. However, it is intended as an intermediate step to loosely structure and collect data. The schema can remain the same as the source system and personally identifiable data like email addresses, phone numbers, or customer IDs can easily be found and modified. Another important DeltaLake capability is Spark Structured Stream support for both ingest and data changes. This creates a unified ETL for both stream and batch while helping promote data quality.

Data Lake Lifecycle

  1. Ingest Data directly from the source or in a temporary storage location (Azure Blob Storage with Lifecycle Management)
  2. Use Spark Structured Streaming or scheduled jobs to load data into DeltaLake Table(s).
  3. Maintain data in DeltaLake table to keep data lake in compliance with data regulations.
  4. Perform analytics on files stored in data lake using DeltaLake tables in Spark or Parquet files after being put in a consistent state using the `VACCUUM` command.

Data Ingestion and Retention

The concept around data retention is to establish policies that ensure that data that cannot be retained should be automatically removed as part of the process.
By default, DeltaLake stores a change data capture history of all data modifications. There are two settings `delta.logRetentionDuration` (default interval 30 days) and `delta.deletedFileRetentionDuration` (default interval 1 week)

%sql
ALTER table_name SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 240 hours', 'delta.deletedFileRetentionDuration'='interval 1 hours')
SHOW TBLPROPERTIES table_name

Load Data in DeltaLake

The key to DeltaLake is a SQL style `MERGE` statement that is optimized to modify only the affected files. This eliminates the need to reprocess and re-write the entire data set.

%sql
MERGE INTO customers
USING updates
ON customers.customerId = updates. customerId
WHEN MATCHED THEN
      UPDATE email_address = updates.email_address
WHEN NOT MATCHED THEN
      INSERT (customerId, email_address) VALUES (updates.customerId, updates.email_address)

Maintain Data in DeltaLake

Just as data can be updated or inserted, it can be deleted as well. For example, if a list of opted_out_consumers was maintained, data from related tables can be purged.

%sql
MERGE INTO customers
USING opted_out_customers
ON opted_out_customers.customerId = customers.customerId

WHEN MATCHED THEN DELETE

Summary

In summary, Databricks DeltaLake enables organizations to continue to store data in Data Lakes even if it’s subject to privacy and data regulations. With DeltaLakes performance optimizations and open parquet storage format, data can be easily modified and accessed using familiar code and tooling. For more information, Databricks DeltaLake and Python syntax references and examples see the documentation. https://docs.databricks.com/delta/index.html