Blog

Latest news & articles

Building a Robust Data Lakehouse with Medallion Architecture
16 April 2024

TL;DR:

This article explores building a robust data lakehouse using the medallion architecture, which organizes data into three layers—Bronze for raw data ingestion, Silver for data transformation, and Gold for optimized data aggregation. Best practices for each layer are outlined, including creating a separate staging area for raw data in the Bronze layer, enforcing data quality checks in the Silver layer, and optimizing query performance in the Gold layer. By following these best practices, organizations can effectively manage data across the data lakehouse, ensuring structured, governed, and optimized data operations.

Routing Databricks traffic through Azure Firewall
29 January 2023

TLDR;

In the case that you want to route all the outbound traffic from Databricks clusters to Azure Firewall, you need to create a UDR (User Defined Route) and add it to the subnet where the Databricks clusters are created. Then add Firewall Policy allow the trusted traffic. The definition of the trusted traffic is mainly the mandatory traffic that Databricks needs to function. Plus the traffic to storage and packages repositories that you deem trusted.

Databricks IP Access list
22 January 2023

TLDR;

IP Access list is one of the ways to netowrk-isolate Azure Databricks. It is a list of IP addresses that are allowed to access Azure Databricks. You can use this list to control access to Azure Databricks from specific IP addresses or ranges of IP addresses.

It’s only accessable by using REST API.

Data Factory & Synapse with managed VNET cannot connect to Keyvault
08 January 2023

Credit

Thank you to Olivier Martin for your valuable insights and contributions to this post. Oliver Martin is a Microsoft Cloud Solution Architect for data analytics & AI.

TLDR;

When creating a linked service to Key vault that’s using private endpoint in a data factory or synapse workspace that is using managed virtual network, the UI doesn’t have a way to test the connection or list the secrets, versions of the key vault.

That’s a known limitation when using managed VNET. The solution is simple, add the secret info manually (using edit not the dropdown) and save the linked service. It will work when used in a pipeline or dataflow.

How to read data using synapsesql connector from Synapse spark with minimum permissions
06 September 2022

TLDR;

The documented minimum permissions required for using the synapsesql connector for spark to read or write data from Synapse SQL Pools is giving high privileges to spark users even though the required operation is only read. In this article, I’ll provide a workaround

Azure Synapse Analytics and Notebooks output
06 March 2022

Notebooks in Synapse

Azure Synapse Analytics’ most appealing feature at first glance is the Synapse Studio. One unified UX across data stores, notebooks and pipelines. Notebook experience is appreciated the most among folks who read a load of data that takes minutes or hours to load then do operations on it whether in data engineering, feature engineering or ML training. The ability to divide your code into smaller chunks that you control which to execute when is a powerful productivity tool.

Added value that the notebook stores not only code by also the results of your code so to speak, it has now a data storage capacity that makes some organizations that are highly regulated or handles high confidential information worried about it.

essential