Resource manager configuration in QJM HA setup of Hadoop 2.6.0 - hadoop

We have configured Hadoop as high availability so that we can achieve automatic failover using Quorum Journal Manager. It is working fine as expected.
But we are not sure how to configure resource manager in 2.6.0 version.
Resource manager is needed for running map reduce programs. We need the configuration steps for resource manager failover setup in Hadoop 2.6.0 between the name nodes.

I don't know about MapReduce. If you have multiple Resource Manager (one Active at a time) you need to set its logical names
http://gethue.com/hadoop-tutorial-yarn-resource-manager-high-availability-ha-in-mr2/
However I am not sure if the logical names for rm1,rm2(...) will be the same. If anybody can confirm this?
It can be achieved in oozie by using "logicaljt" as job-tracker value in your workflow
Source: https://issues.cloudera.org/browse/HUE-1631

Related

Does hadoop itself contains fault-tolerance failover functionality?

I just installed new version of hadoop2, I wish to know if I config a hadoop cluster and it's brought up, how can I know if data transmission is failed, and there's a need for failover?
Do I have to install other components like zookeeper to track/enable any HA events?
Thanks!
High Availability is not enabled by default. I would highly encourage you to read the Hadoop documentation from Apache. (http://hadoop.apache.org/) It will give an overview of the architecture and services that run on a Hadoop cluster.
Zookeeper is required for many Hadoop services to coordinate their actions across the entire Hadoop cluster, regardless of the cluster being HA or not. More information can be found in the Apache Zookeeper documentation (http://zookeeper.apache.org/).

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Switching from Capacity Scheduler to Fair Scheduler on Hortonworks Data Platform

My organization is currently using Hortonworks HDP to manage our Hadoop Cluster. The default YARN scheduler is the Capacity Scheduler. I would like to switch to a Fair Scheduler. I am completely new to HDP.
In the absence of a cluster management suite, this would be done by editing the yarn-site.xml and changing the yarn.resourcemanager.scheduler.class property to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
and creating an extra fair-scheduler.xml file to specify the queue configurations as mentioned here and then referring YARN to that configuration by setting the yarn.scheduler.fair.allocation.file property in yarn-site.xml.
Now in Ambari, while it is possible to change the yarn.resourcemanager.scheduler.class property via the UI, and add a new custom property yarn.scheduler.fair.allocation.file, I cannot (for the love of god) find a way to have ambari read fair-scheduler.xml instead of capacity-scheduler.xml.
So my question is; How do I go about switching to the fair scheduler via Ambari? There's got to be an easy way, right?
Properties in capacity-scheduler.xml
On your RM node, set yarn.scheduler.fair.allocation.file with the full path of your fair-scheduler.xml (or in custom yarn-site under ambari)
tail -n 1000 /var/log/hadoop-yarn/yarn/hadoop-yarn-resourcemanager-master.log | grep "fair-scheduler.xml"
After a restart of your ResourceManager, you should see that it is loading your file:
2019-02-19 15:49:26,358 INFO fair.AllocationFileLoaderService (AllocationFileLoaderService.java:reloadAllocations(230)) - Loading allocation file file:/usr/hdp/current/hadoop-client/conf/fair-scheduler.xml
Works on Hdp 3.1.1, and probably 3.0.0 too

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

Resources