TLDR;
In the case that you want to route all the outbound traffic from Databricks clusters to Azure Firewall, you need to create a UDR (User Defined Route) and add it to the subnet where the Databricks clusters are created. Then add Firewall Policy allow the trusted traffic. The definition of the trusted traffic is mainly the mandatory traffic that Databricks needs to function. Plus the traffic to storage and packages repositories that you deem trusted.
Azure Databricks netowrking depending on either creating a separate VNET or using the existing VNET (aka VNET injection). In the case of VNET injection, you can prevent exfiltration of data by routing all the traffing to Azure Firewall But in this case you need to allow the trusted traffic to go through the firewall. The definition of the trusted traffic is mainly the mandatory traffic that Databricks needs to function. Plus the traffic to storage and packages repositories that you deem trusted.
Do you need exfiltration prevention?
This the important question to ask yourself. If the users of databricks are all your employees, what is the risk of the data being exfiltrated? Remember if employees are not physically operating from a secure location with no cell phones or cameras, then screenshots can be taken. Databricks has the ability to export the notebooks results (albiet can be disabled). So if you take exfiltration prevention really seriously and you closed all the other avenues, then you can consider this option.
Routing outbound traffic to (Azure) Firewall
Azure is using routing to create the VNET concept. At the end of the day, VNET is just routing rules (system routes). You also can add your own routing or what is known as User-Defined Routing (UDR)
You can choose to route everything to the firewall or a selected traffic. With the Databricks SCC model you don’t have to do any exception because there’s no public IPs so there’s no asymetric traffic problem. I usually advise the clients based how agile thier internal processes for approving new IPs or URLs on their firewall.
Firewall policies
Now the tricky part, many of needed FQDNs and IP addressess are documented but at the initial setup, you will find some exceptions that you have to discover them and add them yourselves. On my testing when I prepared this post, I used Azure Firewall and after every failed trial of creating a cluster, I checked the Firewall logs and added the missing FQDNs and IP addresses. I’m adding the video at the end of this post that shows how to do that. Also here’s below the list that I discoered to short your provision time.
Mandatory FQDNs and IP addresses to scuccessully create a cluster
Note: My environemnt was on Canada Central region so check for your region
Databricks-specific FQDNs
The databricks workspace’s file system (DBFS), this is specific to each individual workspace but will always start with dbstoragessbf each workspace will have its own unique storage
dblogprodcacentral.blob.core.windows.net ==> logs for Canada Central
dbartifactsprodcacentral.blob.core.windows.net & dbartifactsprodcaeast.blob.core.windows.net ==> Databricks artifacts for Canada Central
Azure-specific FQDNs
- All these below are not documented and your cluster will fail to create without them. When asked around, these are mainly for Azure monitoring. umsaj10d5kfqfqbgh2fl.blob.core.windows.net, umsavjqq5ch2tpwf3flt.blob.core.windows.net, umsamkd3kdzk11c3thds.blob.core.windows.net, umsamkd3kdzk11c3thds.blob.core.windows.net, md-hdd-2mz4jbjp05lv.z34.blob.storage.azure.net, md-hdd-dvf4xqsbbn1l.z45.blob.storage.azure.net, md-hdd-l0v4vlfmkzrj.z22.blob.storage.azure.net, md-hdd-jdrbkjj4wgbr.z29.blob.storage.azure.net, md-hdd-jdrbkjj4wgbr.z29.blob.storage.azure.net, md-hdd-w1jjc5wcbtlc.z25.blob.storage.azure.net, md-hdd-f3zh34qqsr44.z35.blob.storage.azure.net, md-hdd-m2thlltw2hht.z46.blob.storage.azure.net, md-hdd-xx3g1wlr1ctc.z47.blob.storage.azure.net
- *.monitor.azure.com ==> Azure Monitor
Other
- Ubuntu.com ==> for updates
- cdnjs.cloudflare.com ==> CDN for Databricks UI
Netowrk rules (IP based)
- NTP protocol for time sync (UDP port 123)
If you are using Azure firewall and you are ok with the risk of ex-filtration to SQL or Event Hubs, then it will be easy to use the service tags. But if you are not. If you use another firewall and you are still ok with SQL & Event Hubs, then you have to use the Azure REST APIs preiodically to get the list of IPs for
- Event Hubs on your region
- SQL on your regions
But if you are preventing ex-filtration to SQL & Event Hubs, then you have to find the specific instance used by your databricks and allow it.
Databricks Deep Dive Video Series
on this video series