Configure oozie workflow properties for HA JobTracker - hadoop

With an Oozie workflow, you have to specify the cluster's JobTracker in the properties for the workflow. This is easy when you have a single JobTracker:
jobTracker=hostname:port
When the cluster is configured for HA (high availability) JobTracker, I need to be able to set up my properties files to be able to hit either of the JobTracker hosts, without having to update all my properties files when the JobTracker has failed over to the 2nd node.
When accessing one JobTracker through http, it will redirect to the other if it isn't running, but oozie doesn't use http, so there is no redirect, which causes the workflow to fail if the properties file specifies the job tracker host that is not running.
How can I configure my property file to handle JobTracker running in HA?

I just finished setting up some Oozie workflows to use HA JobTrackers and NameNodes. The key is to use the logical name of the HA service you configured, and not any individual hostnames or ports. For example, the default HA JobTracker name is 'logicaljt'. Replace hostname:port with 'logicaljt', and everything should just work, as long as the node from which you're running Oozie has the appropriate hdfs-site and mapred-site configs properly installed (implicitly due to being part of the cluster, or explicitly due to adding a gateway role to it).

Please specify the nameservice for the cluster in which the HA is enabled.
eg:
in properties file
namenode=hdfs://<nameserivce>
jobTracker=<nameservice>:8032

Related

Adding New Datanodes to An Existing Hadoop Cluster

How should I add a new datanode to an existing hadoop cluster?
Do I just stop all, set up a new datanode server as existing datanodes, and add the new server IP to the namenode and change the number of slaves to a correct number?
Another question is: After I add a new datanode to the cluster, do I need to do anything to balance all datanodes or "re-distribute" the existing files and directories to different datanodes?
For the Apache Hadoop you can select one of two options:
1.- Prepare the datanode configuration, (JDK, binaries, HADOOP_HOME env var, xml config files to point to the master, adding IP in the slaves file in the master, etc) and execute the following command inside this new slave:
hadoop-daemon.sh start datanode
2.- Prepare the datanode just like the step 1 and restart the entire cluster.
3.- To redistribute the existing data you need to enable dfs.disk.balancer.enabled in hdfs-site.xml. This enable the HDFS Disk Balancer and you need to configure a plan.
You don't need to stop anything to add datanodes, and datanodes should register themselves to the Namenode on their own; I don't recall manually adding any information or needing to restart a namenode to detect datanodes (I typically use Ambari to provision new machines)
You will need to manually run the HDFS balancer in order to spread the data over to the new servers

Behavior of HDFS High Availability

How does distributed copy (distcp) work between two clusters when NameNode (NN) fails in High Availability (HA) configuration.
Will that job fail due to different IP address of name node and the standby node?
Depending on the configuration of your HDFS HA and if Automatic Failover is implemented, it might work (I personally haven't tested the specific command during a failover).
Another important part is that you are using names for the services and DNS is properly setup and configured for all involved nodes (you should never use direct IP addresses).
Yashwanth,
In an HA Hadoop cluster, it is not recommended to use active name node in the distcp commands. A simple answer to your question is Yes, if you hardcode Namenode IP or DNS in the distcp command. In an HA hadoop cluster you need to use cluster name in of IP in the distcp command.

Understanding mapreduce.framework.name wrt Hadoop

I am learning Hadoop and came to know that that there are two versions of the framework viz: Hadoop1 and Hadoop2.
If my understanding is correct, in Hadoop1, the execution environment is based on two daemons viz TaskTracker and JobTracker whereas in Hadoop2 (aka yarn), the execution environment is based on "new daemons" viz ResourceManager, NodeManager, ApplicationMaster.
Please correct me if this is not correct.
I came to know of the following configuration parameter:
mapreduce.framework.name
possible values which it can take: local , classic , yarn
I don't understand what does they actually mean; for example if I install Hadoop 2 , then how can it have old execution environment (which has TaskTracker, JobTracker).
Can anyone help me what these values mean?
yarn stands for MR version 2.
classic is for MR version 1
local for local runs of the MR jobs.
MR V1 and MR V2 are just about how resources are managed and a job is executed. The current hadoop release is capable of both (and even in local lightweight mode). When you set the value as yarn, you are simply instructing the framework to use yarn way to execute the job. Similarly when you set it to local, you just telling the framework that there is no cluster for execution and its all within a JVM. It is not a different infrastructure for MR V1 and MR V2 framework; its just the way of job execution, which changes.
jobTracker, TaskTracker etc are all just daemon thread, which are spawned when needed and killed.
MRv1 uses the JobTracker to create and assign tasks to data nodes. This was found to be too inefficient when dealing with large cluster, leading to yarn
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
Local mode is given to simulate and debug MR application within a single machine/JVM.
EDIT: Based on comments
jps (Java Virtual Machine Process Status)is a JVM tool, which according to official page:
The jps tool lists the instrumented HotSpot Java Virtual Machines
(JVMs) on the target system. The tool is limited to reporting
information on JVMs for which it has the access permissions.
So,
jps is not a big data tool, rather a java tool which tells about JVM, however it does not divulge any information on processes running within the JVM.
It only list the JVM, it has access to. It means there still be certain JVMs which remains undetected.
Keeping the above points in mind, if you observed that jsp command emits different result based on hadoop deployment mode:
Local (or Standalone) mode: There are no daemons and everything runs on a single JVM.
Pseudo-Distributed mode: Each daemon(Namenode, Datanode etc) runs on its own JVM on a single host.
Distributed mode: Each Daemon run on its own JVM across a cluster of hosts.
Hence each of the processes may or may not run in same JVM and hence jps output will be different.
Now in distributed mode, the MR v2 framework works in default mode. i.e. yarn; hence you see yarn specific daemons running
Namenode
Datanode
ResourceManager
NodeManager
Apache Hadoop 1.x (MRv1) consists of the following daemons:
Namenode
Datanode
Jobtracker
Tasktracker
Note that NameNode and DataNode are common between two, because they are HDFS specific daemon, while other two are MR v1 and yarn specific.

What data is available in client machine

I know that the client machine consult the name node to store the data it contains.
Also the client machine will have Hadoop installed in it with cluster settings.
What cluster settings are present ?
Whenever an HDFS command is invoked, the Client has to send a request to the Namenode and to do so fs.defaultFS property is required. Similarly when submitting a YARN job, it needs yarn.resourcemanager.address to connect with the ResourceManager.
File level HDFS properties like dfs.blocksize, dfs.replication are determined at the Client node. If they need to be changed from their default, add the respective properties at the Client node.
Normally, the same set of configuration properties (*-site.xml files) defined in the nodes of the Cluster would be defined in the Client node as well. Having a uniform cluster settings among all the nodes of the Cluster inclusive of the Client nodes is considered the best practice.

Resource manager configuration in QJM HA setup of Hadoop 2.6.0

We have configured Hadoop as high availability so that we can achieve automatic failover using Quorum Journal Manager. It is working fine as expected.
But we are not sure how to configure resource manager in 2.6.0 version.
Resource manager is needed for running map reduce programs. We need the configuration steps for resource manager failover setup in Hadoop 2.6.0 between the name nodes.
I don't know about MapReduce. If you have multiple Resource Manager (one Active at a time) you need to set its logical names
http://gethue.com/hadoop-tutorial-yarn-resource-manager-high-availability-ha-in-mr2/
However I am not sure if the logical names for rm1,rm2(...) will be the same. If anybody can confirm this?
It can be achieved in oozie by using "logicaljt" as job-tracker value in your workflow
Source: https://issues.cloudera.org/browse/HUE-1631

Resources