hadoop request specific machines from yarn - hadoop

I want to know whether I can request specific nodes from yarn resource manager when running a MapReduce?
In more detail, let say that there is a yarn cluster deployed with the following nodes nodeA, nodeB, nodeC.
Can I submit a MR job that will run only on nodeB and nodeC?

No, There is no property till current versions of CDH and yarn which allow you to dynamically choose the nodes on which you want to run your jobs. This is taken care by the Resource Manager only.

Related

When do YARN and NameNode interact

When a job is submitted, when do YARN and NameNode interact? When a job is submitted, who does it get sent to? Could someone explain the end-to-end flow - how hadoop ecosystem works?
Thanks!
Namenode: Stores the meta-data of all the data stored in data nodes and monitors the health of data nodes. Basically, it is a master-slave architecture.
YARN: It stands for Yet Another Resource Negotiator. The yarn has mainly two components.
1.> Scheduling
2.> Application Manager
Yarn also contains the master, i.e Resource Manager and Slave, i.e Node Manager.
For scheduling purpose, there are 3 Schedulers:
1.> FIFO 2.> Capacity 3.> Fair-share
There is a component called Application Master assigned by Resource Manager under the Node Manager.
One application master is assigned to one application.
The job is directly submitted by the client and Resource Manager assigns the job to the Application Master and Node manager monitors the liveliness of Application Master
Now, whenever the job comes in, Resource manager creates a job id and assign an Application Master for that job. Resource Manager contacts to the Namenode to retrieve the information about the metadata of the required data on which the task has to be performed. And the information received by Resource Manager is then passed to Application Master.
This is the basic overview of the working of Yarn with Namenode. You can also read in detail from YARN
Also, NameNode interaction is just in the Hadoop applications running within YARN that talk to the NameNode. Not all YARN applications need to communicate with HDFS
Basically there is no direct interaction between YARN and HDFS, see https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
However YARN jobs require some files (libraries, configuration, etc) which usually resides on HDFS

Yarn UI shows no information about applications

I know that the similar question was asked Applications not shown in yarn UI when running mapreduce hadoop job?
but the answers did not solve my problems.
I am running Hadoop streaming on Linux 17.01. I setup a cluster with 3 nodes and 1 master node.
When I start Hadoop, I can access localhost:50070 to see other nodes (all nodes are alive).
However, I see no information in "Application" of localhost:8088
as well as by command "yarn application -list -appStates ALL".
Here is my configuration.
My yarn-site.xml (for all nodes)
Here is all processes on master node
The problems may due to yarn services are running on ipv6. However, I followed I followed this thread
https://askubuntu.com/questions/440649/how-to-disable-ipv6-in-ubuntu-14-04
to change all Yarn services to ipv4. However, still there is no tasks displayed on Yarn UI, even I can see all nodes in my cluster marked as "active" on Yarn UI.
So, I do not know why this happened. Do you have any suggestion?
Thank you very much.
I haven't typically seen YARN being configured for IPv4, but this property is added into the hadoop-env.sh
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"
I'm sure you also add a similar variable into the yarn-env.sh for YARN_OPTS, I think
However, it's not really clear from the your question when / if you've even submitted an application for anything to appear

Is there a way to restrict the application master to launch on particular nodes?

I have a cluster setup with nodes that are not reliable and can go down (They are aws spot instances). I am trying to make sure that my application master only launches on the reliable nodes (aws on demand instances) of the cluster. Is there a workaround for the same? My cluster is being managed by hortonworks ambari.
This can be achieved by using node labels. I was able to use the property in spark spark.yarn.am.nodeLabelExpression to restrict my application master to a set of nodes while running spark on yarn. Add the node labels to whichever nodes you want to use for application masters.

Hadoop on Mesos uses only one node?

I have successfully set up Mesos 0.22.1 cluster on 5 nodes. I can run Marathon and Chronos tasks on all slave nodes. Now I’m trying to run Hadoop jobs using Mesos Scheduler. I have followed very good tutorial and I could run wordcount test job. But when I try to run some larger job (loading data from Kafka to HDFS using Camus) job is running without the errors, but uses only one node with one task tracker, though it has in total 30 map jobs, and my nodes configured to run 2 map jobs in parallel.
What am I missing? Shouldn’t Jobtracker split task to run in parallel on all available nodes using 2 Map slots on eash node?
And what is strange - on Jobtracker webpage cluster summary reports only 1 available node. Is it correct behavior?
Any ideas are greatly appreciated!

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

Resources