How to cleanup azure databricks workspace - azure-databricks

I am migrating databricks workspace from one account to another account, in this process I need to refresh the databricks workspace with updated data (notebooks, users, groups, clusters, database and tables)
Do we have any process to cleanup the databricks workspace?

You can update the Databricks workspace by using Microsoft given API.
To update the Azure Databricks workspace, you can use the below API which will update the specified workspace:
PATCH https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Databricks/workspaces/{workspaceName}?api-version=2018-04-01
To cleanup the resources, you can terminate the cluster. To do so, from the Azure Databricks workspace, from the left pane, select Clusters. For the cluster you want to terminate, move the cursor over the ellipsis under Actions column, and select the Terminate icon. This stops the cluster.
If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate after __ minutes of inactivity checkbox while creating the cluster. In such a case, the cluster automatically stops, if it has been inactive for the specified time.

Related

can databricks cluster be shared across workspace?

My ultimate goal is to differentiate/manage the cost on databricks (azure) based on different teams/project.
And I was thinking whether I could utilize workspace to achieve this.
I read below , it sounds like workspace can access a cluster, but does not say whether multiple workspace can access the same cluster or not.
A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects (notebooks, libraries, and experiments) into folders, and provides access to data and computational resources such as clusters and jobs.
In other words, can I creat a cluster and somehow ensure can be only accessed by certain project or team or workspace?
To manage whom can access a particular cluster, you can make use of cluster access control. With cluster access control, you can determine what users can do on the cluster. E.g. attach to the cluster, the ability to restart it or to fully manage it. You can do this on a user level but also on a user group level. Note that you have to be on Azure Databricks Premium Plan to make use of cluster access control.
You also mentioned that your ultimate goal is to differentiate/manage costs on Azure Databricks. For this you can make use of tags. You can tag workspaces, clusters and pools which are then propagated to cost analysis reports in the Azure portal (see here).

How to use Azure Spot instances on Databricks

Spot instances brings the posibility to use free resources in the cloud paying a lower price, however if the cloud demand is increased your resources will be dealocated. This is very usefull for non critical workloads whenever you can aford to loose some of the work done. More info 2 3
Databricks has the posibility to run spot instances on AWS but there is no documentation about how to do it on Azure.
Is it possible to run Databricks clusters on Azure Spot instances?
Yes, it is possible but not using Databricks UI. To use Azure spot instances on Databricks you need to use databricks cli.
Note
With the cli tool is it possible to administrate -create, edit, delete- clusters and instances-pools. However, to simplify the process, I'll focus on editing an existing cluster.
You can install databricks cli using pip install databricks-cli and configure your credentials with databricks configure --token. For more information, visit databricks documentation.
Run the command datbricks clusters list to know the ID of the cluster you want to modify:
$ datbricks clusters list
0422-112415-fifes919 Big Spark3 TERMINATED
0612-341234-jails230 Normal Spark3 TERMINATED
0212-623261-mopes727 Small 7.6 TERMINATED
In my case, I have 3 clusters. First column is the cluster ID, second one is the name of the cluster. Last column is the state.
The command databricks cluster get generates the cluster config in json format. Let's generate the json file to modify it:
databricks clusters get --cluster-id 0422-112415-fifes919 > /tmp/my_cluster.json
This file contains all the configuration related to the cluster like name, instance type, owner... In our case we are looking for the azure_attributes section. You will see something similar to:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1.0
},
...
We need to change the availability to SPOT_WITH_FALLBACK_AZURE and spot_bid_max_price with our bid price. Edit the file with your favorite tool. The result should be something like:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": 0.4566
},
...
Once modified, just update the cluster with the new configuration file using databricks clusters edit:
databricks clusters edit --json-file /tmp/my_cluster.json
Now, everytime you start the cluster, the workers will be spot instances.To confirm this, you can go to the configuration tab inside the worker VM that is allocated in the resource group managed by databricks. You will see the Azure spot is active and with the price configured.
Databricks on AWS has more configuration options like SPOT for the availability field. However, until the documentation is released we'll need to wait or configure with try-error approach.

Pipeline disappeared from Azure Data Factory?

I have a streaming pipeline from Azure Data Factory that suddenly vanished from Monitoring tab from Azure Data Factory page. It is running on Databricks from quite a long time(May be more than 45 days without interruptions).
One day the pipeline is no longer shown in ADF Pipeline runs in Monitoring tab and the previous runs disappeared as well but none of the alerts set from Databricks side kicked off. Turns out the job is still running from Databricks side but the corresponding ADF pipeline details disappeared along with old runs for the same streaming pipeline.
How is this possible? Any reason for this to happen?
If it still run, the monitor of it should be find.
May be you can check the filter of the pipeline monitor:

How to make Visual Studio use Ambari local user instead of default 'admin' while submitting hive query to HDInsight cluster

This question is in the context of HDInsight Hadoop (Linux) cluster. Once we login to Azure account through Visual Studio , based on whichever Azure subscriptions are added to our account we can see the list of HDInsight clusters in those subscriptions in the Server Explorer -- and then we can simply write/submit hive queries to those clusters directly. This doesn't involve any authentication -- as it happens behind the scenes seamlessly. But when we do that , the submitted hive queries use the default 'admin' as the user. What if I have created some local Ambari user and I would like Visual Studio to use that instead of the default 'admin'? Is there any way to force Visual Studio to use local Ambari user?
If you have created a local Ambari user. You can Link to or edit a cluster in Visual Studio.
To link an HDInsight cluster:
Right-click HDInsight, and then select Link a HDInsight Cluster to
display the Link a HDInsight Cluster dialog box.
Enter a Connection Url in the form
https://CLUSTERNAME.azurehdinsight.net. The Cluster Name
automatically fills in with the cluster name portion of your URL
when you go to another field. Then enter a Username and Password,
and select Next.
Select Finish. If the cluster linking is successful, the cluster is then listed under the HDInsight node.
To update a linked cluster, right-click the cluster and select Edit. You can then update the cluster information.
Reference: Azure HDInsight - Link to or edit a cluster

Elasticsearch on GKE - Unable to configure snapshot and restore using gcs plugin

We are setting up an ES cluster on GKE in the following format:
Master nodes as kubernetes deployments
Client nodes as kubernetes deployments with HPA
Data nodes as stateful sets with PVs
We are able to set up the cluster well. But then we are struggling in configuring the snapshot backup mechanism. Essentially, we are following this guide. We are able to follow this upto the step of getting the secret json key. Afterwards, we are not sure how to add this to the elasticsearch keystore and proceed further. Would be great if someone can help us out with this
In IAM & admin > Service Accounts, click on Create Service Account put the name you wish and click Create, then in Role select "Storage Admin" for example and then click Continue, then click Done.
After that on the on the Service Accounts section, click on the 3 dots of the Service Account you create and select "Create key" > JSON > Create
This will generate/download the JSON file.
Then you need to storage this key in the Elasticsearch keystore.
https://www.elastic.co/guide/en/elasticsearch/reference/7.4/secure-settings.html

Resources