How to submit a job to specifc nodes in Hadoop? - hadoop

I have a Hadoop cluster with 1 Master and 5 slaves. Is there any way of submitting jobs to specific set of slaves? Basically what i am trying to do is benchmark my application with many possibilities. So after testing with 5 slaves, I would like to run my application with 4 slaves and then 3 slaves and so on.
Currently the only way I know of is decommissioning a slave and removing from the hadoop cluster. But that seems to be a tedious task. I was wondering if there is an easier approach so as to avoid removing a node from the cluster.
Thanks.

In hadoop/conf there is a file called 'slaves' here you can simply add or remove nodes, and then restart your dfs and mapred.

There is a setting that points to a file with a list of excluded hosts you can set in the mapred-site-xml. Though also a bit cumbersome, changing a single configuration value might be preferable physically decommissioning and recommissioning multiple nodes. You could prepare multiple host exclusion files in advance, change the setting and restart the mapreduce service. Restarting the mapreduce service is pretty quick.
In 0.23 this setting is named mapreduce.jobtracker.hosts.exclude.filename. This is a feature introduced in 0.21, though I believe the setting was named mapred.hosts.exclude then. Check what this setting is called for the version of Hadoop you are using.

For those who encounter this problem, comments from Alex and stackoverflow question will help in successfully decommissioning a node from hadoop cluster.
EDIT : Just editing files hdfs-site.xml and mapred-site.xml and executing hadoop dfsadmin -refreshNodes might put your datanode into decommissioning node status for a long time. So it is also necessary to change dfs.replication to an appropriate value.

Related

hadoop 50070 Datanode tab won't show both data nodes

I know it is a duplicate question but since the other questions did not have answers with it, I am reposting it. I have recently installed hadoop cluster using 2 VMs on my laptop. I could go and checkout port 50070 and under datanodes tab I can see only one data node, but I have 2 data nodes, one on master node and other on slave node. What could be the reasons?
Sorry, feels like it's been a time. But still I'd like to share my answer: the root cause is from hadoop/etc/hadoop/hdfs-site.xml: the xml file has a property named dfs.datanode.data.dir. If you set all the datanodes with the same name, then hadoop is assuming the cluster has only one datanode. So the proper way of doing it is naming every datanode with a unique name:
Regards, YUN HANXUAN

How do I configure and reboot an HDInsight cluster running on Azure?

Specifically, I want to change the maximum number of mappers and the maximum number of reducers for each node in an HDInsight cluster running on Microsoft Azure.
Using remote desktop, I logged in to the head node. I edited the mapred-site.xml file on the head node and changed the mapred.tasktracker.map.tasks.maximum and the mapred.tasktracker.reduce.tasks.maximum values. I tried rebooting the head node, but I was not able to reboot. I used the start-onebox.cmd and stop-onebox.cmd scripts to try and start/stop HDInsight.
I then ran a streaming mapreduce passing the desired number of reducers to the hadoop-streaming.jar, but the number of reducers was still limited by the previous value of mapred.tasktracker.reduce.tasks.maximum. Most of my reducers were pending execution.
Do I need to change the mapred-site.xml file on every node? Is there an easy way to change this, or do I need to remote desktop into every node? How do I reboot or restart the cluster so that my new values are used?
Thanks
I know it has been a while since the question was posted, but I would like to post for other users who may find useful.
There are 2 ways you can change Hadoop configuration files (such as mapred-site.xml, hive-site.xml etc) on HDinsight
Option #1:
This is the easiest - you can supply the hadoop configuration values per job, as shown in this blog
Option #2:
You can customize HDinsight cluster with hadoop configuration values during provisioning or installing a cluster, as shown in this blog
Manually modifying a config file is not supported and the change will be lost when the Azure VM gets re-imaged.

How to separate Hadoop MapReduce from HDFS?

I'm curious if you could essentially separate the HDFS filesystem from the MapReduce framework. I know that the main point of Hadoop is to run the maps and reduces on the machines with the data in question, but I was wondering if you could just change the *.xml files to change the configuration of what machine the jobtracker, namenode and datanodes are running on.
Currently, my configuration is a 2 VMs setup: one (the master) with Namenode, Datanode, JobTracker, Tasktracker (and the SecondaryNameNode), the other (the slave) with DataNode, Tasktraker. Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each). The bottleneck will be the data transfer between the two VMs for the computations of maps and reduces, but since the data at this stage is so small I'm not primarily concerned with it. I would just like to know if this configuration is possible, and how to do it. Any tips?
Thanks!
You don't specify this kind of options in the configuration files.
What you have to do is to take care of what kind of deamons you start on each machine(you call them VMs but I think you mean machines).
I suppose you usually start everything using the start-all.sh script which you can find in the bin directory under the hadoop installation dir.
If you take a look at this script you will see that what it does is to call a number of sub-scripts corresponding to starting the datanodes, tasktrackers and namenode, jobtracker.
In order to achive what you've said, I would do like this:
Modify the masters and slaves files as this:
Master file should contain the name of machine1
Slaves should contain the name of machine2
Run start-mapred.sh
Modify the masters and slaves files as this:
Master file should contain the machine1
Slaves file should contain machine1
Run start-dfs.sh
I have to tell you that I've never tried such a configuration so I'm not sure this is going to work but you can give it a try. Anyway the solution is in this direction!
Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each).
First, I am not sure why to separate the computation from the storage. The whole purpose of MR locality is lost, thought you might be able to run the job successfully.
Use the dfs.hosts, dfs.hosts.exclude parameters to control which datanodes can connect to the namenode and the mapreduce.jobtracker.hosts.filename, mapreduce.jobtracker.hosts.exclude.filename parameters to control which tasktrackers can connect to the jobtracker. One disadvantage of this approach is that the datanodes and tasktrackers are started on the nodes which are excluded and aren't part of the Hadoop cluster.
Another approach is to modify the code to have a separate slave file for the tasktracker and the datanode. Currently, this is not supported in Hadoop and would require a code change.

Topologies for hadoop cluster?

In what topologies we can make hadoop cluster? Currently only running with tree structure,single master and multiple slaves. Am looking for more variants like multiple master etc.
Your question is interesting, after seeing the question for any tool or framework which provides multiple masters in Hadoop. But, i didn't find one.
Coming to hadoop, NameNode (Master) - Multiple DataNodes (Slaves), so according to hadoop's design perceptive this fair enough.
But, their is another addition to the NameNode called as secondary namenode, did n't think about it as a another master. refer here http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/

Hadoop dfs.include file

Please explain what's dfs.include file purpose and how to define it.
I've added a new node to the Hadoop cluster but it's not identified by the namenode. In one of the posts I found that dfs.include can resolve this issue.
Thank you in advance,
Vladi
Just including the node name in the dfs.include and mapred.include is not sufficient. The slave file has to be updated on the namenode/jobtracker. The tasktracker and the datanode have to be started on the new node and the refreshNodes command has to be run on the NameNode and the JobTracker to make them aware of the new node.
Here are the instructions on how to do this.
According to the 'Hadoop : The Definitive Guide'
The file (or files) specified by the dfs.hosts and mapred.hosts properties is different from the slaves file. The former is used by the namenode and jobtracker to determine which worker nodes may connect. The slaves file is used by the Hadoop control scripts to perform cluster-wide operations, such as cluster restarts. It is never used by the Hadoop
daemons.

Resources