how to create Amazon EMR cluster from the command line in Ubuntu? - hadoop

how to create Amazon EMR cluster from the command line in Ubuntu? I have the private key,access key and the pem file?....Can anyone guide me as how to run the word count example from the command line

You can use AWS command line tools (CLI) for this. http://docs.aws.amazon.com/cli/latest/userguide/installing.html
Once these are installed, you have to configure the tool using 'aws configure' command and enter priate key, access key.
http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
You will also need to enter the region where your EMR cluster (and other resources) will be launched.
To create cluster, the 'create-cluster' command need to be used.
http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html
You dont need the pem file for these steps.
Once the cluster is launched, you can run the word count demo as a 'step'. You can add word count demo as a 'step'
Starting a cluster and running a hadoop job (a script in this case):
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html
Some examples of add-steps is in this section for an already running cluster:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

Related

Why hadoop commands don't work on google cloud shell

After creating cluster for my project in google DataProc I've tried to type several commands for Hadoop (like hadoop fs -ls). Unfortunately it appears cloud shell doesn't see Hadoop at all!
-bash: hadoop: command not found
Someone on stackoverflow said:
"It doesn't work in Cloud Shell because it doesn't have Hadoop CLI
utilities pre-installed."
But I've no idea how to install it or either activate it. Maybe through cluster creation, but had issue with creating it through dataproc API. I've done it through cloud shell instead.
What should I do to use Hadoop commands in cloud shell properly?
apparently hadoop commands works only on VM Instances not on general project directory. So make sure you connect to cluster via Compute Engine -> VM instances -> [your node] in INSTANCES tab via SSH

Run custom shell script on all slave nodes in EMR

AWS Step documentation says steps only execute on the master, does that mean even if I am logged in to any of the slave nodes and execute the add-steps command on it, the command would go and add the step on to the master node only? How can I then execute a custom shell script on all the slave nodes? Bootstrapping is not an option since the shell script requires the emrf-site.xml to be already created which does not happen until the EMR is completely up and running.
You can use "Custom JAR" step type to run "script-runner.jar" that will run any bash script on every cluster node:
aws emr create-cluster --name ... --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
More info here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html

How to Start and Stop Cloudera Cluster CD5 Using Command Line or Shell Script

I have installed Cloudera Cluster on AWS EC2 instances.
Easily I can start or stop it using cloudera manager.
But now I want to make a shell script that can start or stop it.
What is the command line to start and stop the cluster and all its services?

Adding new Spark workers on AWS EC2 - access error

I have the existing oeprating Spark cluster that was launched with spark-ec2 script. I'm trying to add new slave by following the instructions:
Stop the cluster
On AWS console "launch more like this" on one of the slaves
Start the cluster
Although the new instance is added to the same security group and I can successfully SSH to it with the same private key, spark-ec2 ... start call can't access this machine for some reason:
Running setup-slave on all cluster nodes to mount filesystems, etc...
[1] 00:59:59 [FAILURE] xxx.compute.amazonaws.com
Exited with error code 255 Stderr: Permission denied (publickey).
, obviously, followed by tons of other errors while trying to deploy Spark stuff on this instance.
The reason is that Spark Master machine doesn't have an rsync access for this new slave, but the 22 port is open...
The issue was that SSH key generated on Spark Master was not transferred to this new slave. Spark-ec2 script with start command omits this step. The solution is to use launch command with --resume options. Then the SSH key is transferred to the new slave and everything goes smooth.
Yet another solution is to add the master's public key (~/.ssh/id_rsa.pub) to the newly added slaves ~/.ssh/authorized_keys. (Got this advice on Spark mailing list)

hadoop cluster clarification

I am a newbie in hadoop and I am trying to run a hadoop jar on Amazon EC2. I have started my amazon ec2 instance through the console, uploaded my files to the dfs and then was able to successfully run the job jar and generate output on the instance.
But still I am confused on one part. I am not sure if the job was run on a single machine in amazon ec2 or was it ran on a cluster? How do I find the number of worker nodes involved for my jar run?
In some reference links I see we have to use launch-cluster command , for example "bin/hadoop-ec2 launch-cluster test-cluster 2" . What is the difference in starting the instance from the console and using this command like launch-cluster.

Resources