Assigning the host name to a hadoop job using weka in AWS - hadoop

I have been using the wekaDistributedHadoop1.0.4 and wekaDistributedBase1.0.2 packages on my local machine to run some of the basic jobs. There is a field "HDFS host" which must be filled in order to run the jobs. I have been using "localhost" since I have been testing on my local machine and this works fine. I blindly tried using "localhost" when running on AWS EMR but the job failed. What I would like to know is what host name should I be entering into the field so that weka will call on the correct master? Is it the public DNS name which is provided when starting the cluster or is there a method in the API which gets that address for me?

If you want to manually do it.
Create a cluster and keep it alive, you can find info in amazon ec2 instances manage console, in security group elastic mapreduce master/slave. Find it out, login master node and edit conf file and fill with right name.
If you need automatically do it.
Write a shell executed in bootstrap. You can refer to https://serverfault.com/questions/279297/what-is-the-easiest-way-to-get-a-ec2-public-dns-inside-a-running-instance

Related

Run deck Automation script

I need a help, I am trying to write bash/shell script which will be placed in Rundeck tool. As my org has more than 10,000 severs Ec2. This is what I am expecting.
script to login into Ec2.
show output of df -h,lsblk & Java version.
Please anyone help me with the script.
You need to configure the EC2 nodes on Rundeck using a model source. To avoid configuring each EC2 node you can use the EC2 Model Source Plugin, take a look at this (check this to learn how to install plugins on Rundeck), after that you can create a job with an inline script step using a node filter pointing to your ec2 nodes.

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

Spark Cluster EC2 - SSH Error

When setting up a basic cluster using the supplied spark script for ec2, the cluster is created (1 mater, 1 slave) but I continuously get SSH hostname resolution errors.
This is because the host names are empty.
As understand it, the point of the script is that it creates the instances and so it should know the names and be able to resolve these as part of the setup.
So the question is - am I supposed to configure ec2-script before trying to launch the cluster?

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

Monitoring instances in cloud

I usually use Munin as monitoring software, but this (as others software I presume) needs an IP to make the ICMP or whatever pings to collect data.
In Amazon EC2 instances are created on the fly, with IP's you don't know.
How can they be monitored ?
I was thinking about using amazon console commands to read the IP's of the instances up, and change the monit configuration file on the fly also , but it can be too complicated ... or not?
Any other solution / suggestion ?
Thank you
I use revealcloud to monitor my amazon instances. You can install it once and create an ami from that systen, or bootstrap the install command if that's your method. Since the install is just one command, it's easy enough to put into the rc.local (or similar). You can then see all the instances in the dashboard or topiew as soon as they boot up.
Our instances are bootstrapped using chef recipes, so it's easier for me to provide IPs/hosts as they (= all members of my cluster) get entered into /etc/hosts on start-up. Generally, it doesn't hurt to use elastic IPs for a master server and allow all connections (in /etc/munin/munin.conf by default).
I'd solve the security 'question' on the security groups level. E.g. allow only instances with a certain security group to connect to the munin-node process (on port 4949). The question which remains is.
E.g., using ec2-authorize you can achieve
ec2-authorize mygroup -o monitorgroup -u <AWS-USER-ID>
This means that all instances with group monitorgroup can access resources on instances with mygroup.
Let me know if this helps!
If your Munin master and nodes are all hosted on EC2 than it's better to use internal hosts like domU-00-00-00-00-00-00.compute-1.internal. because this way you don't have to deal with IP addresses and security groups.
You also have to set this in /etc/munin/munin-node.conf:
allow ^.*$
You can read more about it in Monitoring AWS Ubuntu Instances using Munin
But if your Munin master is not on EC2 your best bet is to attach Elastic IP to your EC2 instance.

Resources