How to separately specify a set of nodes for HDFS and others for MapReduce jobs? - hadoop

While deploying hadoop, I want some set of nodes to run HDFS server but not to run any MapReduce tasks.
For example, there are two nodes A and B that run HDFS.
I want to exclude the node A from running any map/reduce task.
How can I achieve it? Thanks

If you do not want to run any MapReduce job in a particular node or a set of nodes,
Stopping the nodemanager daemon would be the simplest option if they are already running.
Run this command on the nodes where the MR tasks should not be attempted.
yarn-daemon.sh stop nodemanager
Or exclude the hosts using the property yarn.resourcemanager.nodes.exclude-path in yarn-site.xml
<property>
<name>yarn.resourcemanager.nodes.exclude-path</name>
<value>/path/to/excludes.txt</value>
<description>Path of the file containing the hosts to exclude. Should be readable by YARN user</description>
</property>
On adding this property, refresh the resourcemanager
yarn rmadmin -refreshNodes
The nodes specified in the file will be exempted from attempting MapReduce tasks.

I answer my question
If you use Yarn for resource management,
go check franklinsijo's answer.
If you use standalone mode,
make a list of nodes that you will run MR tasks and specify its path as 'mapred.hosts' at mapred-default file. (https://hadoop.apache.org/docs/r1.2.1/mapred-default.html)

Related

How to submit MR job to YARN cluster with ResourceManager HA wrt Hortowork's HDP?

I am trying to understand how to submit a MR job to Hadoop cluster, YARN based.
Case1:
For case in which there is only one ResourceManager (that is NO HA), we can submit the job like this (which i actually used and I believe is correct).
hadoop jar word-count.jar com.example.driver.MainDriver -fs hdfs://master.hadoop.cluster:54310 -jt master.hadoop.cluster:8032 /first/dir/IP_from_hdfs.txt /result/dir
As can be seen, RM is running on port 8032 and NN on 54310 and I am specifying the hostname becasue there is only ONE master.
case2:
Now, for the case when there is HA for both NN and RM, how do I submit the job? I am not able to understand this, because now we have two RM and NN (active / standby), and I understand that there is zookeeper to keep track of failures. So, from client perspective trying to submit a job, do I need to know the exact NN and RM for submitting the job or is there some logical naming which we have to use for submitting the job?
Can anyone please help me understand this?
With or without HA, the command to submit the job remains the same.
hadoop jar <jar> <mainClass> <inputpath> <outputpath> [args]
Using -fs and -jt is optional and are not used unless you want to specify a Namenode and JobTracker that is different from the one in the configurations.
If the fs.defaultFS property in core-site.xml and the properties defining the nameservice (dfs.nameservices) and its namenodes are configured properly in hdfs-site.xml of the client, the Active Master will be chosen whenever a client operation is performed.
By default, this Java class is used by the DFS Client to determine which NameNode is currently Active.
<property>
<name>dfs.client.failover.proxy.provider.<nameserviceID></name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

Without yarn map-reduce work?

I'm studying about hadoop map-reduce on centos 6.5 and hadoop 2.7.2. I learned that hdfs is just distributed file system and Yarn administers map-reduce work, so I thought that if i don't turn on Yarn(resource manager, node manager), map-reduce doesn't work.
Therefore, I think, wordcount should not do map-reduce process in the system working only hdfs not yarn.
(on the pseudo distribute mode)
But when I turn hdfs on not Yarn as you see in the below, and execute wordcount example, it show 'map-reduce framework'. What's it meaning? Does it possible only hdfs process map-reduce without Yarn? Because Yarn manage resource and job, is it right that map-reduce doesn't work without Yarn?
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /user/input /user/output
With Hadoop 2.0 YARN takes responsibility of resource management, this is true. But even without YARN the Map Reduce applications can run using the older flavor.
The mapred-site.xml has a configuration - mapreduce.framework.name
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
</configuration>
The above can be configured to choose whether to use YARN or not. The possible values for this property are - local, classic or yarn
The default value is "local". Set this to yarn, if you want to use YARN

unable to see Task tracker and Jobtracker after Hadoop single node installation 2.5.1

Iam new to Hadoop 2.5.1. As i have already installed Hadoop 1.0.4 previously, i thought installation process would be same so followed following tutorial.
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Every thing was fine, even i have given these settings in core-site.xml
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
But i have seen in several sites this value as 9000.
And also changes in yarn.xml.
Still everything works fine when i run a mapreduce job. But my question is
when i run command jps it gives me this output..
hduser#secondmaster:~$ jps
5178 ResourceManager
5038 SecondaryNameNode
4863 DataNode
5301 NodeManager
4719 NameNode
6683 Jps
I dont see task tracker and job tracker in jps. Where are these demons running.
And without these deamons how am i able to run Mapreduce job.
Thanks,
Sreelatha K.
From hadoop version hadoop 2.0 onwards, default processing framework has been changed to YARN from Classic Mapreduce. You are using YARN, where you cannot see Jobtracker, Tasker in YARN. Jobtracker and Tasktracker is replaced by Resource manager and Nodemanager respectively in YARN.
But still you have an option to use Classic Mapreduce framework instead of YARN.
In Hadoop 2 there is an alternative method to run MapReduce jobs, called YARN. Since you have made changes in yarn.xml, MapReduce processing happens using YARN, not using the traditional MapReduce framework. That's probably be the reason why you don't see TaskTracker and JobTracker listed after executing the jps command. Note that ResourceManager and NodeManager are the daemons for YARN.
YARN is next generation of Resource Manager who can able to integrate with Apache spark, storm and many more tools you can use to write map-reduce jobs

DataNode doesn't start in one of the slaves

I am trying to configure hadoop with 5 slaves. After I run start-dfs.sh in the master there is only one slave node which doesn't run DataNode. I tried looking for some difference in the configuration files in that node but I didn't find anything.
There WAS a difference in the configuration files! In the core-site.xml the hadoop.tmp.dir variable was set to a invalid directory so it couldn't be created when the DataNode was started. Lesson learned: look in the logs (Thanks Chris)

How to remove a hadoop node from DFS but not from Mapred?

I am fairly new to hadoop. For running some benchmarks, I need variety of hadoop configuration for comparison.
I want to know a method to remove a hadoop slave from DFS (not running datanode daemon anymore) but not from Mapred (keep running tasktracker), or vice-versa.
AFAIK, there is a single slave file for such hadoop nodes and not separate slave files for DFS and Mapred.
Currently, I am trying to start both DFS and Mapred on the slave node , and then killing datanode on the slave. But it takes a while to put that node in to 'dead nodes' on HDFS GUI. Any parameter can be tuned to make this timeout quicker ?
Thankssss
Try using dfs.hosts and dfs.hosts.exclude in the hdfs-site.xml, mapred.hosts and mapred.hosts.exclude in mapred-site.xml. These are for allowing/excluding hosts to connect to the NameNode and the JobTracker.
Once the list of nodes in the files has been updated appropriately, the NameNode and the JobTracker have to be refreshed using the hadoop dfsadmin -refreshNodes and hadoop mradmin -refreshNodes command respectively.
Instead of using slaves file to start all processes on your cluster, you can start only required daemons on each machine if you have few nodes.

Resources