My ultimate goal is to differentiate/manage the cost on databricks (azure) based on different teams/project.
And I was thinking whether I could utilize workspace to achieve this.
I read below , it sounds like workspace can access a cluster, but does not say whether multiple workspace can access the same cluster or not.
A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects (notebooks, libraries, and experiments) into folders, and provides access to data and computational resources such as clusters and jobs.
In other words, can I creat a cluster and somehow ensure can be only accessed by certain project or team or workspace?
To manage whom can access a particular cluster, you can make use of cluster access control. With cluster access control, you can determine what users can do on the cluster. E.g. attach to the cluster, the ability to restart it or to fully manage it. You can do this on a user level but also on a user group level. Note that you have to be on Azure Databricks Premium Plan to make use of cluster access control.
You also mentioned that your ultimate goal is to differentiate/manage costs on Azure Databricks. For this you can make use of tags. You can tag workspaces, clusters and pools which are then propagated to cost analysis reports in the Azure portal (see here).
Related
By default, ML HTTP server will use the Module DB inside ML.
(It seems all ML training materials refer to that type of configuration.)
Any changes in the XQuery programs will need to upload into the Module DB first. That could be accomplished by using mlLoadModules or mlReloadModules ml-gradle commands.
CI/CD does not access the ML cluster directly. Everything is via ml-gradle from a machine dedicated from code deployment to different ML enviroments like dev/uat/prod etc.
However it is also possible to configure the ML app server to use the XQuery program from physical disk location like below screenshot.
With that configuration, it is not required to reload the programs into ML Module DB.
The changes in the program have to be in the ML server itself. CI/CD will need to access to the ML cluster directly. One advantage of this way is that developer will easily see whether the changes in the program have been indeed deployed, as all changes are sitting as physical readable text files in the disk.
Questions:
Which way is better? Why?
Any ML query perforemance difference between these two different approaches?
For the physical file approach, does it mean that CI/CD will need to deploy the program changes to all the ML hosts in the ML cluster? (I guess it is not a concern if HTTP server refers XQuery programs from Module DB inside ML. ML cluster will auto sync the code among different hosts.)
In general, it's recommended to deploy modules to a database rather than the filesystem.
This makes deployment more simple and easy, as you only have to load the module once into the modules database, rather than putting the file on every single host. If you use the filesystem, then you need to put those files on every host in the cluster.
With a modules database, if you were to add nodes to the cluster, you don't have to also deploy the modules. You can then also take advantage of High Availability, backup and restore, and all the other features of a database.
Once a module is read, it is loaded into caches, so the performance impact should be negligible.
If you plan to use REST extensions, then you would need a modules database so that the configurations can be installed in that database.
Some might look to use filesystem for simple development on a single node, in which changes saved to the filesystem are made available without re-deploying. However, you could use something like the ml-gradle mlWatch task to auto-deploy modules as they are modified on the filesystem and achieve effectively the same thing using a modules database.
Spot instances brings the posibility to use free resources in the cloud paying a lower price, however if the cloud demand is increased your resources will be dealocated. This is very usefull for non critical workloads whenever you can aford to loose some of the work done. More info 2 3
Databricks has the posibility to run spot instances on AWS but there is no documentation about how to do it on Azure.
Is it possible to run Databricks clusters on Azure Spot instances?
Yes, it is possible but not using Databricks UI. To use Azure spot instances on Databricks you need to use databricks cli.
Note
With the cli tool is it possible to administrate -create, edit, delete- clusters and instances-pools. However, to simplify the process, I'll focus on editing an existing cluster.
You can install databricks cli using pip install databricks-cli and configure your credentials with databricks configure --token. For more information, visit databricks documentation.
Run the command datbricks clusters list to know the ID of the cluster you want to modify:
$ datbricks clusters list
0422-112415-fifes919 Big Spark3 TERMINATED
0612-341234-jails230 Normal Spark3 TERMINATED
0212-623261-mopes727 Small 7.6 TERMINATED
In my case, I have 3 clusters. First column is the cluster ID, second one is the name of the cluster. Last column is the state.
The command databricks cluster get generates the cluster config in json format. Let's generate the json file to modify it:
databricks clusters get --cluster-id 0422-112415-fifes919 > /tmp/my_cluster.json
This file contains all the configuration related to the cluster like name, instance type, owner... In our case we are looking for the azure_attributes section. You will see something similar to:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1.0
},
...
We need to change the availability to SPOT_WITH_FALLBACK_AZURE and spot_bid_max_price with our bid price. Edit the file with your favorite tool. The result should be something like:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": 0.4566
},
...
Once modified, just update the cluster with the new configuration file using databricks clusters edit:
databricks clusters edit --json-file /tmp/my_cluster.json
Now, everytime you start the cluster, the workers will be spot instances.To confirm this, you can go to the configuration tab inside the worker VM that is allocated in the resource group managed by databricks. You will see the Azure spot is active and with the price configured.
Databricks on AWS has more configuration options like SPOT for the availability field. However, until the documentation is released we'll need to wait or configure with try-error approach.
I have created two empty groups on two differentt nodes of my cluster, just one on each node. My ranger service uses unix user synchronization, when I restart the Ranger service I cant see my added groups to cluster nodes in Ranger UI, I use HDP 2.5. How to sync my ranegr with unix users?
As you try to sync users, you already seem to understand that there are users for the os, and users for the hadoop platform.
Typically OS users are admin/ops people who need to manage the environment, while most platform users are engineers, analysts, and other who want to do something on the platform. This large group of users is something you typically want to sync.
As already indicated by #cricket you can integrate with LDAP/AD as explained here:
https://community.hortonworks.com/articles/105620/configuring-ranger-usersync-with-adldap-for-a-comm.html
Can you help me with installing cosmos-gui? I think you are one of the developers behind cosmos? Am I right?
We have already installed Cosmos, and now we want to install cosmos-gui.
In the link below, I found the install guide:
https://github.com/telefonicaid/fiware-cosmos/blob/develop/cosmos-gui/README.md#prerequisites
Under subchapter “Prerequisites” is written
A couple of sudoer users, one within the storage cluster and another one wihtin the computing clusters, are required. Through these users, the cosmos-gui will remotely run certain administration commands such as new users creation, HDFS userspaces provision, etc. The access through these sudoer users will be authenticated by means of private keys.
What is meant by the above? Must I create, a sudo user for the computing and storage cluster? And for that, do need to install a MySQL DB?
And under subchapter “Installing the GUI.”
Before continuing, remember to add the RSA key fingerprints of the Namenodes accessed by the GUI. These fingerprints are automatically added to /home/cosmos-gui/.ssh/known_hosts if you try an ssh access to the Namenodes for the first time.
I can’t make any sense about the above. Can you give a step by step plan?
I hope you can help me.
JH
First of all, a reminder about the Cosmos architecture:
There is a storage cluster based on HDFS.
There is a computing cluster based on shared Hadoop or based on Sahara; that's up to the administrator.
There is a services node for the storage cluster, a special node not storing data but exposing storage-related services such as HttpFS for data I/O. It is the entry point to the storage cluster.
There is a services node for the computing cluster, a special node not involved in the computations but exposing computing-related services such as Hive or Oozie. It is the entry point to the computing cluster.
There is another machine hosting the GUI, not belonging to any cluster.
Being said that, the paragraphs you mention try to explain the following:
Since the GUI needs to perform certain sudo operations on the storage and computing clusters for user account creation purposes, then a sudoer user must be created in both the services nodes. These sudoer users will be used by the GUI in order to remotely perform the required operations on top of ssh.
Regarding the RSA fingerprints, since the operations the GUI performs on the services nodes are executed in top of ssh, then the fingerprints the servers send back when you ssh them must be included in the .ssh/known_hosts file. You may do this manually, or simply ssh'ing the services nodes for the first time (you will be prompted to add the fingerprints to the file or not).
MySQL appears in the requirements because that section is about all the requisites in general, and thus they are listed. Not necessarily there may be relation maong them. In this particular case, MySQL is needed in order to store the accounts information.
We are always improving the documentation, we'll try to explain this better for the next release.
It is possible, to connect my Hadoop cluster to multiple Google Cloud Projects at once ?
I can easly use any Google Storage bucket in single Google Project via Google Cloud Storage Connector as explained in this thread Migrating 50TB data from local Hadoop cluster to Google Cloud Storage. But i can't find any documentation or example how to connect to two or more Google Cloud Project from single map-reduce job. Do You have any suggestion/trick ?
Thanks a lot.
Indeed, it is possible to connect your cluster to buckets from multiple different projects at once. Ultimately, if you're using the instructions for using a service-account keyfile, the GCS requests are performed on behalf of that service-account, which can be treated more-or-less like any other user. You can either add the service account email your-service-account-email#developer.gserviceaccount.com to all the different cloud projects owning buckets you want to process, using the permissions section of cloud.google.com/console and simply adding that email address like any other member, or you can set GCS-level access to add that service-account like any other user.