How to submit MR job to YARN cluster with ResourceManager HA wrt Hortowork's HDP? - hadoop

I am trying to understand how to submit a MR job to Hadoop cluster, YARN based.
Case1:
For case in which there is only one ResourceManager (that is NO HA), we can submit the job like this (which i actually used and I believe is correct).
hadoop jar word-count.jar com.example.driver.MainDriver -fs hdfs://master.hadoop.cluster:54310 -jt master.hadoop.cluster:8032 /first/dir/IP_from_hdfs.txt /result/dir
As can be seen, RM is running on port 8032 and NN on 54310 and I am specifying the hostname becasue there is only ONE master.
case2:
Now, for the case when there is HA for both NN and RM, how do I submit the job? I am not able to understand this, because now we have two RM and NN (active / standby), and I understand that there is zookeeper to keep track of failures. So, from client perspective trying to submit a job, do I need to know the exact NN and RM for submitting the job or is there some logical naming which we have to use for submitting the job?
Can anyone please help me understand this?

With or without HA, the command to submit the job remains the same.
hadoop jar <jar> <mainClass> <inputpath> <outputpath> [args]
Using -fs and -jt is optional and are not used unless you want to specify a Namenode and JobTracker that is different from the one in the configurations.
If the fs.defaultFS property in core-site.xml and the properties defining the nameservice (dfs.nameservices) and its namenodes are configured properly in hdfs-site.xml of the client, the Active Master will be chosen whenever a client operation is performed.
By default, this Java class is used by the DFS Client to determine which NameNode is currently Active.
<property>
<name>dfs.client.failover.proxy.provider.<nameserviceID></name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

Related

How to separately specify a set of nodes for HDFS and others for MapReduce jobs?

While deploying hadoop, I want some set of nodes to run HDFS server but not to run any MapReduce tasks.
For example, there are two nodes A and B that run HDFS.
I want to exclude the node A from running any map/reduce task.
How can I achieve it? Thanks
If you do not want to run any MapReduce job in a particular node or a set of nodes,
Stopping the nodemanager daemon would be the simplest option if they are already running.
Run this command on the nodes where the MR tasks should not be attempted.
yarn-daemon.sh stop nodemanager
Or exclude the hosts using the property yarn.resourcemanager.nodes.exclude-path in yarn-site.xml
<property>
<name>yarn.resourcemanager.nodes.exclude-path</name>
<value>/path/to/excludes.txt</value>
<description>Path of the file containing the hosts to exclude. Should be readable by YARN user</description>
</property>
On adding this property, refresh the resourcemanager
yarn rmadmin -refreshNodes
The nodes specified in the file will be exempted from attempting MapReduce tasks.
I answer my question
If you use Yarn for resource management,
go check franklinsijo's answer.
If you use standalone mode,
make a list of nodes that you will run MR tasks and specify its path as 'mapred.hosts' at mapred-default file. (https://hadoop.apache.org/docs/r1.2.1/mapred-default.html)

Without yarn map-reduce work?

I'm studying about hadoop map-reduce on centos 6.5 and hadoop 2.7.2. I learned that hdfs is just distributed file system and Yarn administers map-reduce work, so I thought that if i don't turn on Yarn(resource manager, node manager), map-reduce doesn't work.
Therefore, I think, wordcount should not do map-reduce process in the system working only hdfs not yarn.
(on the pseudo distribute mode)
But when I turn hdfs on not Yarn as you see in the below, and execute wordcount example, it show 'map-reduce framework'. What's it meaning? Does it possible only hdfs process map-reduce without Yarn? Because Yarn manage resource and job, is it right that map-reduce doesn't work without Yarn?
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /user/input /user/output
With Hadoop 2.0 YARN takes responsibility of resource management, this is true. But even without YARN the Map Reduce applications can run using the older flavor.
The mapred-site.xml has a configuration - mapreduce.framework.name
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
</configuration>
The above can be configured to choose whether to use YARN or not. The possible values for this property are - local, classic or yarn
The default value is "local". Set this to yarn, if you want to use YARN

Why we are configuring mapred.job.tracker in YARN?

What I know is YARN is introduced and it replaced JobTracker and TaskTracker.
I have seen is some Hadoop 2.6.0/2.7.0 installation tutorials and they are configuring mapreduce.framework.name as yarn and mapred.job.tracker property as local or host:port.
The description for mapred.job.tracker property is
"The host and port that the MapReduce job tracker runs at. If "local",
then jobs are run in-process as a single map and reduce task."
My doubt is why are configuring it if we are using YARN , I mean JobTracker shouldn't be running right?
Forgive me if my question is dumb.
Edit: These are the tutorials I was talking about.
http://chaalpritam.blogspot.in/2015/01/hadoop-260-multi-node-cluster-setup-on.html
http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/
https://chawlasumit.wordpress.com/2015/03/09/install-a-multi-node-hadoop-cluster-on-ubuntu-14-04/
This is just a guess, but either those tutorials talking about configuring the JobTracker in YARN are written by people who don't know what YARN is, or they set it in case you decide to stop working with YARN someday. You are right: the JobTracker and TaskTracker do not exist in YARN. You can add the properties if you want, but they will be ignored. New properties for each of the components replacing the JobTracker and the TaskTracker were added with YARN, such as yarn.resourcemanager.address to replace mapred.jobtracker.address.
If you list your Java processes when running Hadoop under YARN, you see no JobTrackeror TaskTracker:
10561 Jps
20605 NameNode
17176 DataNode
18521 ResourceManager
19625 NodeManager
18424 JobHistoryServer
You can read more about how YARN works here.

Hadoop jobs fail when submitted by users other than yarn (MRv2) or mapred (MRv1)

I am running a test cluster running MRv1 (CDH5) paired with LocalFileSystem, and the only user I am able to run jobs as is mapred (as mapred is the user starting the jobtracker/tasktracker daemons). When submitting jobs as any other user, the jobs fail because the jobtracker/tasktracker is unable to find the job.jar under the .staging directory.
I have the exact same issue with YARN (MRv2) when paired with LocalFileSystem, i.e. when submitting jobs by a user other than 'yarn', the application master is unable to locate the job.jar under the .staging directory.
Upon inspecting the .staging directory of the user submitting the job I found that job.jar exists under the .staging// directory, but the permissions on the and .staging directories are set to 700 (drwx------) and hence the application master / tasktracker is not able to access the job.jar and supporting files.
We are running the test cluster with LocalFileSystem since we use only MapReduce part of the Hadoop project paired with OCFS in our production setup.
Any assistance in this regard would be immensely helpful.
You need to be setting up a staging directory for each user in the cluster. This is not as complicated as it sounds.
Check the following properties:
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<source>core-default.xml</source>
</property>
This basically setups a tmp directory for each user.
Tie this to your staging directory :
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>${hadoop.tmp.dir}/mapred/staging</value>
<source>mapred-default.xml</source>
</property>
Let me know if this works or if it already setup this way.
These properties should be in yarn-site.xml - if i remember correctly.
This worked for me, I just set this property in MR v1:
<property>
<name>hadoop.security.authorization</name>
<value>simple</value>
</property>
Please go through this:
Access Control Lists
${HADOOP_CONF_DIR}/hadoop-policy.xml defines an access control list for each Hadoop service. Every access control list has a simple format:
The list of users and groups are both comma separated list of names. The two lists are separated by a space.
Example: user1,user2 group1,group2.
Add a blank at the beginning of the line if only a list of groups is to be provided, equivalently a comman-separated list of users followed by a space or nothing implies only a set of given users.
A special value of * implies that all users are allowed to access the service.
Refreshing Service Level Authorization Configuration
The service-level authorization configuration for the NameNode and JobTracker can be changed without restarting either of the Hadoop master daemons. The cluster administrator can change ${HADOOP_CONF_DIR}/hadoop-policy.xml on the master nodes and instruct the NameNode and JobTracker to reload their respective configurations via the -refreshServiceAcl switch to dfsadmin and mradmin commands respectively.
Refresh the service-level authorization configuration for the NameNode:
$ bin/hadoop dfsadmin -refreshServiceAcl
Refresh the service-level authorization configuration for the JobTracker:
$ bin/hadoop mradmin -refreshServiceAcl
Of course, one can use the security.refresh.policy.protocol.acl property in ${HADOOP_CONF_DIR}/hadoop-policy.xml to restrict access to the ability to refresh the service-level authorization configuration to certain users/groups.
Examples
Allow only users alice, bob and users in the mapreduce group to submit jobs to the MapReduce cluster:
<property>
<name>security.job.submission.protocol.acl</name>
<value>alice,bob mapreduce</value>
</property>
Allow only DataNodes running as the users who belong to the group datanodes to communicate with the NameNode:
<property>
<name>security.datanode.protocol.acl</name>
<value>datanodes</value>
</property>
Allow any user to talk to the HDFS cluster as a DFSClient:
<property>
<name>security.client.protocol.acl</name>
<value>*</value>
</property>

Start namenode without formatting

I tried to start namenode using bin/start-all.sh. But, this command doesnt start namenode. I know if I do bin/hadoop namenode -format , namenode will start but in that case, I will lose all my data. Is there a way to start namenode without formatting it?
Your problem might be related to the following:
Hadoop writes its NameNode data in /tmp/hadoop- folder by default which is cleaned after every reboot.
Add following property to conf/hdfs-site.xml
<property>
<name>dfs.name.dir</name>
<value><path to your desired folder></value>
</property>
The "dfs.name.dir" property allows you to control where Hadoop writes NameNode metadata.
bin/start-all.sh should start the namenode, as well as the datanodes, the jobtracker and the tasktrackers. So, check the log of the namenode for possible errors.
An alternative way to skip starting the jobtracker and the tasktrackers and just start the namenode (and the datanodes) is by using the command:
bin/start-dfs.sh
Actually, bin/start-all.sh is equivalent to using the commands:
bin/start-dfs.sh, which starts the namenode and datanodes and
bin/start-mapred.sh, which starts the jobtracker and the tasktrackers.
For more details, visit this page.

Resources