Unity Catalog set-up in Azure databricks
Here I am covering steps to follow to set-up Databricks unity catalogue with ADLS gen 2.
What is unity catalog?
Unity catalog is a data catalog and governance solution introduced by Databricks. It enables users to organize, manage and govern their data assets within the Databricks lakehouse platform, facilitating data discovery, lineage tracking, and metadata management.
Requirements..
To set-up unity catalog, user need to be a ‘global active directory admin’ and must have access to databricks account management console.
1. Access account console
To get started, you’ll need to access account management page using above link. After login to account console, you will come to below landing page
2. Configuring permissions
To ensure databricks workspace can access the metastore and data storage create Access Connector for Azure Databricks (managed identity) and grant Storage Blob Data Contributor or higher IAM role on the storage account. This storage account is used to stored metadata and data for managed tables.
3. Create metastore
When creating the metastore, you’ll need to provide a name for metastore and select a region. Each region should have its own Unity Catalog metastore. If you have multiple databricks workspaces in one region, create the metastore in that region. You’ll need to provide an ADLS Gen2 path where the managed table data and metadata will be stored. It’s recommended to create a dedicated container for the unity catalog.
Additionally, you will need to provide databricks access connector resource id in access connector id field.
Note: You can have one unity catalog per region per subscription. And for a workspace to use unity catalog, it must have a unity catalog metastore attached.
4. Assign workspace and exploring the unity catalog
Next, you need to assign the unity catalog resource to your databricks workspace. Make sure the workspace is in the same environment and is a premium workspace. This step allows the Unity Catalog to work seamlessly with your workspace. Review any additional dialogues that may appear during the setup process.
Assign workspace/s to metastore
5. Running workflows and exploring the unity catalog
Once the setup is complete, you can start working with the unity catalog. In data explorer section, you’ll see the catalog you created and the option to explore it.
Then you can open the workspace linked to unity metastore and you can view the tables, examine the lineage, add metadata and comments, manage permissions, and access insights on data usage. The unity catalog provides a comprehensive view of your data assets and helps with data governance and discovery.
There is no need of running cluster to view artefacts (schema/databases /tables) created in unity catalog.
Unity catalog table details
Databricks still supports legacy hive metastore. To view artefacts created under hive metastore you need running databricks cluster.
6. Explore unity catalog features
Once you open the table created under unity catalog, you can see the schema of the data, sample data(running cluster required), permissions and lineage.
Use permissions tab to list access, grant access and revoke access
Use lineage tab to see the provenance of the data asset. Here you can view the upstream and downstream data assets. Here you can see the notebooks used to create these data assets. Lineage tab also shows the lineage graph like below
Lineage graph
Errors
[UC_NOT_ENABLED] Unity Catalog is not enabled on this cluster
Check if DBR version greater than 11 is selected
Check credential passthrough not enabled
Should see a unity_catlog as tag on cluster
Conclusion
Data catalogs and data governance are essential for managing and deriving value from data lakes and data meshes. With Unity Catalog in databricks, you can easily set up and utilize a catalog and governance system for your lake house. By following the steps outlined in this blog post, you can enable Unity Catalog with ADLS storage in your Azure databricks environment. So, take control of your data assets and unlock the full potential of your data.
This blog has been written based on personal experience and to the best of my knowledge. If you come across any errors in this text or code, or if you have any suggestions for improvements, please feel free to leave a comment.
Comments (0)