Getting list of EMR Release labels via Amazon API - hadoop

I need to receive the list of available EMR Release labels in order to run my Java application which starts an EC2 instance and executes a hadoop job. The main problem here that EMR Release labels are specific for each region and I need to get this list dynamically. The sample of the code used in my application:
runJobFlowRequest.setReleaseLabel( "emr-4.8.0" );
Does anyone have an idea where I could obtain this list of Release labels programmatically via Amazon API in order to use it in my application?

I do not think we have any API exposed to get this list programatically. Usually most regions will have same release labels and EMR Console should show the most updated list for a specific region.
You can also see the list on the documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-relguide-links.html

Related

How to use Azure Spot instances on Databricks

Spot instances brings the posibility to use free resources in the cloud paying a lower price, however if the cloud demand is increased your resources will be dealocated. This is very usefull for non critical workloads whenever you can aford to loose some of the work done. More info 2 3
Databricks has the posibility to run spot instances on AWS but there is no documentation about how to do it on Azure.
Is it possible to run Databricks clusters on Azure Spot instances?
Yes, it is possible but not using Databricks UI. To use Azure spot instances on Databricks you need to use databricks cli.
Note
With the cli tool is it possible to administrate -create, edit, delete- clusters and instances-pools. However, to simplify the process, I'll focus on editing an existing cluster.
You can install databricks cli using pip install databricks-cli and configure your credentials with databricks configure --token. For more information, visit databricks documentation.
Run the command datbricks clusters list to know the ID of the cluster you want to modify:
$ datbricks clusters list
0422-112415-fifes919 Big Spark3 TERMINATED
0612-341234-jails230 Normal Spark3 TERMINATED
0212-623261-mopes727 Small 7.6 TERMINATED
In my case, I have 3 clusters. First column is the cluster ID, second one is the name of the cluster. Last column is the state.
The command databricks cluster get generates the cluster config in json format. Let's generate the json file to modify it:
databricks clusters get --cluster-id 0422-112415-fifes919 > /tmp/my_cluster.json
This file contains all the configuration related to the cluster like name, instance type, owner... In our case we are looking for the azure_attributes section. You will see something similar to:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1.0
},
...
We need to change the availability to SPOT_WITH_FALLBACK_AZURE and spot_bid_max_price with our bid price. Edit the file with your favorite tool. The result should be something like:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": 0.4566
},
...
Once modified, just update the cluster with the new configuration file using databricks clusters edit:
databricks clusters edit --json-file /tmp/my_cluster.json
Now, everytime you start the cluster, the workers will be spot instances.To confirm this, you can go to the configuration tab inside the worker VM that is allocated in the resource group managed by databricks. You will see the Azure spot is active and with the price configured.
Databricks on AWS has more configuration options like SPOT for the availability field. However, until the documentation is released we'll need to wait or configure with try-error approach.

Monitoring EBS volumes for istances with CloudWatch Agent and CDK

I'm trying to set up a way to monitor disk usage for instances belonging to an AutoScaling Group, and add an alarm when the volumes associated to the instances are almost full.
Since it seems there are no metrics normally offered by Amazon to do that, I resorted using the CloudWatch Agent to get what I wanted. So far so good, I can create graphs and alarms for the metrics I want using the CloudWatch console.
My issue is how to automate everything with CDK. How can I automate the creation of the metric for each instance, without knowing the instance id beforehand? Is there a solution for this issue?
You can install and config CloudWatch agent via EC2 user data and the auto scaling group uses launch template to launch EC2 instance. All of those things can be done by AWS CDK.
There is an example from this open source project for your reference.
Another approach you could take is using AWS Systems Manager. Essentially, you install an SSM agent for your instances, and create an SSM Document (think Shell/Python script) that will run your setup script/automation.
You then create a State Manager Association, tying the SSM Document with your instances based on EC2 tags e.g. Application=MyApp or Team=MyTeam. This way, you don't have to provide any resource ids, just the tag key value pair which could extend multiple instances and future instance replacements. You can schedule it to run at specific times (cron) or at a certain frequency (rate) to enforce state.

Working with pre emptive VMs on Google Compute Engine

I'm trying to work with multiple Pre-emptive VM instances on Google Compute Engine for elastic search service and facing some doubts as follows:-
Is 30sec window enough to store data from a preemptive elastic search instance to a stable VM?
How to save the state of one VM that is ending and restore it to other?
Is there an alternative to Google Autoscaler?
You could try running a 'shutdown-script' to create a snapshot with the command:
gcloud compute disks snapshot [disk_name] --zone=[zone] --snapshot-names=[snapshot_name]
Although you should manage to have different snapshot names. Using this you would have a backup of the current VM state, but there is no automatic way to create another VM from this snapshot when is created.
As far as I know there is no predefined alternative that do the same as the autoscaler. Although you could also try using the shutdown script to starts a VM.

How can I change the default bucket of a google-cloud-based hadoop-enable cluster after I created it?

After I create a google-cloud-based hadoop-enable cluster, I want to change the default bucket to a different one, how can I do that? I can't find the answer in google cloud doscumentation. Thanks!
Did you create a cluster by hand, using bdutil, using Cloud Dataproc or through some other means?
bdutil
If you used bdutil, see the choose a default file system section in the setup documentation.
Cloud Dataproc
If you used Cloud Dataproc, you can access any bucket to which your project has permission by using the gs:// uri. If you want to connect your cluster to a new bucket for logs, you will have to create a new cluster, unfortunately.
Other method
If you used a different method, like the "click to deploy" launcher, I recommend you give Dataproc or bdutil a try.

Capacity scheduler in Amazon Elastic MapReduce

I am totally new to Amazon Elastic MapReduce. I have a need that I want to use my custom scheduler, which is implemented based on Hadoop capacity scheduler, to schedule my jobs in Amazon Elastic MapReduce.
According to my current understanding, to achieve this, I can define only one stage in the job flow, and submit my custom jar file via SSH connection to the master node. However, I cannot find how can I edit the xml configuration files, like capacity-scheduler.xml in the master node. Anyone knows how to do that?
Moreover, if I want to add the dynamic sizing property onto it, can I dynamically tune the number of task nodes in the cluster, when the job is currently running? Or in per stage, the size of a cluster should remain the same? Thank you so much.
You should use a bootstrap action to change Hadoop configuration.
The following AWS doc can be referenced for Hadoop configuratio bootstrap action.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop
This blog article that I bookmarked also has some info.
http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/
For changing the cluster size dynamically, one option is to use the AWS SDK.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/calling-emr-with-java-sdk.html
Using the following interface you can modify the instance count of the instance group.
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/AmazonElasticMapReduce.html

Resources