How do I configure and reboot an HDInsight cluster running on Azure? - hadoop

Specifically, I want to change the maximum number of mappers and the maximum number of reducers for each node in an HDInsight cluster running on Microsoft Azure.
Using remote desktop, I logged in to the head node. I edited the mapred-site.xml file on the head node and changed the mapred.tasktracker.map.tasks.maximum and the mapred.tasktracker.reduce.tasks.maximum values. I tried rebooting the head node, but I was not able to reboot. I used the start-onebox.cmd and stop-onebox.cmd scripts to try and start/stop HDInsight.
I then ran a streaming mapreduce passing the desired number of reducers to the hadoop-streaming.jar, but the number of reducers was still limited by the previous value of mapred.tasktracker.reduce.tasks.maximum. Most of my reducers were pending execution.
Do I need to change the mapred-site.xml file on every node? Is there an easy way to change this, or do I need to remote desktop into every node? How do I reboot or restart the cluster so that my new values are used?
Thanks

I know it has been a while since the question was posted, but I would like to post for other users who may find useful.
There are 2 ways you can change Hadoop configuration files (such as mapred-site.xml, hive-site.xml etc) on HDinsight
Option #1:
This is the easiest - you can supply the hadoop configuration values per job, as shown in this blog
Option #2:
You can customize HDinsight cluster with hadoop configuration values during provisioning or installing a cluster, as shown in this blog
Manually modifying a config file is not supported and the change will be lost when the Azure VM gets re-imaged.

Related

How to delete datanode from hadoop clusters without losing data

I want to delete datanode from my hadoop cluster, but don't want to lose my data. Is there any technique so that data which are there on the node which I am going to delete may get replicated to the reaming datanodes?
What is the replication factor of your hadoop cluster?
If it is default which is generally 3, you can delete the datanode directly since the data automatically gets replicated. this process is generally controlled by name node.
If you changed the replication factor of the cluster to 1, then if you delete the node, the data in it will be lost. You cannot replicate it further.
Check all the current data nodes are healthy, for these you can go to the Hadoop master admin console under the Data nodes tab, the address is normally something link http://server-hadoop-master:50070
Add the server you want to delete to the files /opt/hadoop/etc/hadoop/dfs.exclude using the full domain name in the Hadoop master and all the current datanodes (your config directory installation can be different, please double check this)
Refresh the cluster nodes configuration running the command hdfs dfsadmin -refreshNodes from the Hadoop name node master
Check the Hadoop master admin home page to check the state of the server to remove at the "Decommissioning" section, this may take from couple of minutes to several hours and even days depending of the volume of data you have.
Once the server is shown as decommissioned complete, you may delete the server.
NOTE: if you have other services like Yarn running on the same server, the process is relative similar but with the file /opt/hadoop/etc/hadoop/yarn.exclude and then running yarn rmadmin -refreshNodes from the Yarn master node

where to find the M/R configuration file and update it

our Hadoop cluster shows the job tracker process eat up memory gradually that we have to restart the cluster every week. I searched around for the possible solution for this. one of the post mentioned to decrease 'mapred.jobtracker.completeuserjobs.maximum' to 5, so i checked mapred-site.xml under /hadoop-install/conf directory on name node and found there are two entries for that parameter, one set it to 30, the other set it to 5, when i goto any of the data node and check mapred-site.xml, i don't find the setting for that parameter at all. however when I checked running job on M/R administration page and checked their job file, it showed that parameters set to 100. I'm really confused where does this parameter is set. and if i updated it, do i need to restart the cluster? we are running apache Hadoop 1.2.1 on google cloud
Hadoop does not automatically copy the configuration files from your driver machine to all of the cluster machines You need to do that via scp and/or rsync or preferably an automated deployment tool like chef, ansible, puppet, etc.
As far as the individual job parameters: you can actually set them on per-job basis by using -D:
<path to jar>/myHadoopJobJar.jar -Dmapred.jobtracker.completeuserjobs.maximum=5

How to submit a job to specifc nodes in Hadoop?

I have a Hadoop cluster with 1 Master and 5 slaves. Is there any way of submitting jobs to specific set of slaves? Basically what i am trying to do is benchmark my application with many possibilities. So after testing with 5 slaves, I would like to run my application with 4 slaves and then 3 slaves and so on.
Currently the only way I know of is decommissioning a slave and removing from the hadoop cluster. But that seems to be a tedious task. I was wondering if there is an easier approach so as to avoid removing a node from the cluster.
Thanks.
In hadoop/conf there is a file called 'slaves' here you can simply add or remove nodes, and then restart your dfs and mapred.
There is a setting that points to a file with a list of excluded hosts you can set in the mapred-site-xml. Though also a bit cumbersome, changing a single configuration value might be preferable physically decommissioning and recommissioning multiple nodes. You could prepare multiple host exclusion files in advance, change the setting and restart the mapreduce service. Restarting the mapreduce service is pretty quick.
In 0.23 this setting is named mapreduce.jobtracker.hosts.exclude.filename. This is a feature introduced in 0.21, though I believe the setting was named mapred.hosts.exclude then. Check what this setting is called for the version of Hadoop you are using.
For those who encounter this problem, comments from Alex and stackoverflow question will help in successfully decommissioning a node from hadoop cluster.
EDIT : Just editing files hdfs-site.xml and mapred-site.xml and executing hadoop dfsadmin -refreshNodes might put your datanode into decommissioning node status for a long time. So it is also necessary to change dfs.replication to an appropriate value.

Cloudera installation Doubts?

I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.
Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
Excerpt:
Overview
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
meta-data.
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.
From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Important:
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions
Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

Resources