How to use Azure Spot instances on Databricks

How to use Azure Spot instances on Databricks - azure-databricks

Spot instances brings the posibility to use free resources in the cloud paying a lower price, however if the cloud demand is increased your resources will be dealocated. This is very usefull for non critical workloads whenever you can aford to loose some of the work done. More info 2 3
Databricks has the posibility to run spot instances on AWS but there is no documentation about how to do it on Azure.
Is it possible to run Databricks clusters on Azure Spot instances?

Yes, it is possible but not using Databricks UI. To use Azure spot instances on Databricks you need to use databricks cli.
Note
With the cli tool is it possible to administrate -create, edit, delete- clusters and instances-pools. However, to simplify the process, I'll focus on editing an existing cluster.
You can install databricks cli using pip install databricks-cli and configure your credentials with databricks configure --token. For more information, visit databricks documentation.
Run the command datbricks clusters list to know the ID of the cluster you want to modify:
$ datbricks clusters list
0422-112415-fifes919 Big Spark3 TERMINATED
0612-341234-jails230 Normal Spark3 TERMINATED
0212-623261-mopes727 Small 7.6 TERMINATED
In my case, I have 3 clusters. First column is the cluster ID, second one is the name of the cluster. Last column is the state.
The command databricks cluster get generates the cluster config in json format. Let's generate the json file to modify it:
databricks clusters get --cluster-id 0422-112415-fifes919 > /tmp/my_cluster.json
This file contains all the configuration related to the cluster like name, instance type, owner... In our case we are looking for the azure_attributes section. You will see something similar to:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1.0
},
...
We need to change the availability to SPOT_WITH_FALLBACK_AZURE and spot_bid_max_price with our bid price. Edit the file with your favorite tool. The result should be something like:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": 0.4566
},
...
Once modified, just update the cluster with the new configuration file using databricks clusters edit:
databricks clusters edit --json-file /tmp/my_cluster.json
Now, everytime you start the cluster, the workers will be spot instances.To confirm this, you can go to the configuration tab inside the worker VM that is allocated in the resource group managed by databricks. You will see the Azure spot is active and with the price configured.
Databricks on AWS has more configuration options like SPOT for the availability field. However, until the documentation is released we'll need to wait or configure with try-error approach.

Related

can databricks cluster be shared across workspace?

My ultimate goal is to differentiate/manage the cost on databricks (azure) based on different teams/project.
And I was thinking whether I could utilize workspace to achieve this.
I read below , it sounds like workspace can access a cluster, but does not say whether multiple workspace can access the same cluster or not.
A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects (notebooks, libraries, and experiments) into folders, and provides access to data and computational resources such as clusters and jobs.
In other words, can I creat a cluster and somehow ensure can be only accessed by certain project or team or workspace?

To manage whom can access a particular cluster, you can make use of cluster access control. With cluster access control, you can determine what users can do on the cluster. E.g. attach to the cluster, the ability to restart it or to fully manage it. You can do this on a user level but also on a user group level. Note that you have to be on Azure Databricks Premium Plan to make use of cluster access control.
You also mentioned that your ultimate goal is to differentiate/manage costs on Azure Databricks. For this you can make use of tags. You can tag workspaces, clusters and pools which are then propagated to cost analysis reports in the Azure portal (see here).

hazelcast-jet deployment and data ingestion

I have a distributed system running on AWS EC2 instances. My cluster has around 2000 nodes. I want to introduce a stream processing model which can process metadata being periodically published by each node (cpu usage, memory usage, IO and etc..). My system only cares about the latest data. It is also OK with missing a couple of data points when the processing model is down. Thus, I picked hazelcast-jet which is an in-memory processing model with great performance. Here I have a couple of questions regarding the model:
What is the best way to deploy hazelcast-jet to multiple ec2 instances?
How to ingest data from thousands of sources? The sources push data instead of being pulled.
How to config client so that it knows where to submit the tasks?
It would be super useful if there is a comprehensive example where I can learn from.

What is the best way to deploy hazelcast-jet to multiple ec2 instances?
Download and unzip the Hazelcast Jet distribution on each machine:
$ wget https://download.hazelcast.com/jet/hazelcast-jet-3.1.zip
$ unzip hazelcast-jet-3.1.zip
$ cd hazelcast-jet-3.1
Go to the lib directory of the unzipped distribution and download the hazelcast-aws module:
$ cd lib
$ wget https://repo1.maven.org/maven2/com/hazelcast/hazelcast-aws/2.4/hazelcast-aws-2.4.jar
Edit bin/common.sh to add the module to the classpath. Towards the end of the file is a line
CLASSPATH="$JET_HOME/lib/hazelcast-jet-3.1.jar:$CLASSPATH"
You can duplicate this line and replace -jet-3.1 with -aws-2.4.
Edit config/hazelcast.xml to enable the AWS cluster discovery. The details are here. In this step you'll have to deal with IAM roles, EC2 security groups, regions, etc. There's also a best practices guide for AWS deployment.
Start the cluster with jet-start.sh.
How to config client so that it knows where to submit the tasks?
A straightforward approach is to specify the public IPs of the machines where Jet is running, for example:
ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().setName("jet");
clientConfig.addAddress("54.224.63.209", "34.239.139.244");
However, depending on your AWS setup, these may not be stable, so you can configure to discover them as well. This is explained here.
How to ingest data from thousands of sources? The sources push data instead of being pulled.
I think your best option for this is to put the data into a Hazelcast Map, and use a mapJournal source to get the update events from it.

What's the right way to provide Hadoop/Spark IAM role based access for S3?

We have Hadoop cluster running on EC2 and EC2 instances attached to a role which has access to S3 bucket for example: "stackoverflow-example".
Several users are placing Spark jobs in the cluster, we used keys in the past but do not want to continue and want to migrate to role, so any jobs placed on the Hadoop cluster will use role associated with ec2 instances. Did a lot of search and found 10+ tickets, some of them are still open, some of them are fixed and some of them do not have any comments.
Want to know whether it's still possible to use IAM role for jobs(Spark, Hive, HDFS, Oozie, etc) placing on Hadoop cluster. Most of the tutorials are discussing passing key (fs.s3a.access.key, fs.s3a.secret.key) which is not good enough and not secured as well. We also faced issues with credential provider with Ambari.
Some references:
https://issues.apache.org/jira/browse/HADOOP-13277
https://issues.apache.org/jira/browse/HADOOP-9384
https://issues.apache.org/jira/browse/SPARK-16363

That first one you link to HADOOP-13277 says "can we have IAM?" to which the JIRA was closed "you have this in s3a". The second, HADOOP-9384, was "add IAM to S3n", closed as "switch to s3a". And SPARK-16363? incomplete bugrep.
If you use S3a, and do not set any secrets, then the s3a client will fall back to looking at the special EC2 instance metadata HTTP server, and try to get the secrets from there.
That it: it should just work.

How can I change the default bucket of a google-cloud-based hadoop-enable cluster after I created it?

After I create a google-cloud-based hadoop-enable cluster, I want to change the default bucket to a different one, how can I do that? I can't find the answer in google cloud doscumentation. Thanks!

Did you create a cluster by hand, using bdutil, using Cloud Dataproc or through some other means?
bdutil
If you used bdutil, see the choose a default file system section in the setup documentation.
Cloud Dataproc
If you used Cloud Dataproc, you can access any bucket to which your project has permission by using the gs:// uri. If you want to connect your cluster to a new bucket for logs, you will have to create a new cluster, unfortunately.
Other method
If you used a different method, like the "click to deploy" launcher, I recommend you give Dataproc or bdutil a try.

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks

Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio