Blog

16 April 2024

Building a Robust Data Lakehouse with Medallion Architecture

TL;DR:

This article explores building a robust data lakehouse using the medallion architecture, which organizes data into three layers—Bronze for raw data ingestion, Silver for data transformation, and Gold for optimized data aggregation. Best practices for each layer are outlined, including creating a separate staging area for raw data in the Bronze layer, enforcing data quality checks in the Silver layer, and optimizing query performance in the Gold layer. By following these best practices, organizations can effectively manage data across the data lakehouse, ensuring structured, governed, and optimized data operations.

Data-Engineering

#Azure Databricks

#Medallion Architecture

#Data Lakehouse

#Multi-hop Architecture

#Delta Lake

#Apache Spark

29 January 2023

Routing Databricks traffic through Azure Firewall

TLDR;

In the case that you want to route all the outbound traffic from Databricks clusters to Azure Firewall, you need to create a UDR (User Defined Route) and add it to the subnet where the Databricks clusters are created. Then add Firewall Policy allow the trusted traffic. The definition of the trusted traffic is mainly the mandatory traffic that Databricks needs to function. Plus the traffic to storage and packages repositories that you deem trusted.

Data-Platform

#Azure Databricks

#Azure Firewall

22 January 2023

Databricks IP Access list

TLDR;

IP Access list is one of the ways to netowrk-isolate Azure Databricks. It is a list of IP addresses that are allowed to access Azure Databricks. You can use this list to control access to Azure Databricks from specific IP addresses or ranges of IP addresses.

It’s only accessable by using REST API.

Data-Platform

#Azure Databricks

08 January 2023

Data Factory & Synapse with managed VNET cannot connect to Keyvault

Credit

Thank you to Olivier Martin for your valuable insights and contributions to this post. Oliver Martin is a Microsoft Cloud Solution Architect for data analytics & AI.

TLDR;

When creating a linked service to Key vault that’s using private endpoint in a data factory or synapse workspace that is using managed virtual network, the UI doesn’t have a way to test the connection or list the secrets, versions of the key vault.

That’s a known limitation when using managed VNET. The solution is simple, add the secret info manually (using edit not the dropdown) and save the linked service. It will work when used in a pipeline or dataflow.

Data-Platform

#Azure Synapse

#Data Factory

#Key Vault

06 September 2022

How to read data using synapsesql connector from Synapse spark with minimum permissions

TLDR;

The documented minimum permissions required for using the synapsesql connector for spark to read or write data from Synapse SQL Pools is giving high privileges to spark users even though the required operation is only read. In this article, I’ll provide a workaround

Data-Platform

#Azure Synapse

#Synapse Notebooks

#Synapse Spark

06 March 2022

Azure Synapse Analytics and Notebooks output

Notebooks in Synapse

Azure Synapse Analytics’ most appealing feature at first glance is the Synapse Studio. One unified UX across data stores, notebooks and pipelines. Notebook experience is appreciated the most among folks who read a load of data that takes minutes or hours to load then do operations on it whether in data engineering, feature engineering or ML training. The ability to divide your code into smaller chunks that you control which to execute when is a powerful productivity tool.

Added value that the notebook stores not only code by also the results of your code so to speak, it has now a data storage capacity that makes some organizations that are highly regulated or handles high confidential information worried about it.

Data-Platform

#Azure Synapse

#Synapse Notebooks

#Synapse Spark

Building a Robust Data Lakehouse with Medallion Architecture

TL;DR:

Routing Databricks traffic through Azure Firewall

TLDR;

Databricks IP Access list

TLDR;

Data Factory & Synapse with managed VNET cannot connect to Keyvault

Credit

TLDR;

How to read data using synapsesql connector from Synapse spark with minimum permissions

TLDR;

Azure Synapse Analytics and Notebooks output

Notebooks in Synapse

Categories

Tags

Revolution Data Platforms

Product

Company

Other

Contact Us