hadoop cluster is not running map reduce jobs - issue with scheduler

(this is a follow-up on a discussion I had regarding an earlier question I had on this matter)
I set up a small Hadoop cluster following these instructions but using Hadoop version 2.7.4. The cluster seems to work OK, but I cannot run mapreduce jobs. In particular, when trying the following
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar randomwriter outdenter code here
the job prints
17/11/27 16:35:21 INFO client.RMProxy: Connecting to ResourceManager at
Running 0 maps.
Job started: Mon Nov 27 16:35:22 UTC 2017
17/11/27 16:35:22 INFO client.RMProxy: Connecting to ResourceManager at
17/11/27 16:35:22 INFO mapreduce.JobSubmitter: number of splits:0
17/11/27 16:35:22 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1511799491035_0006
17/11/27 16:35:22 INFO impl.YarnClientImpl: Submitted application
17/11/27 16:35:22 INFO mapreduce.Job: The url to track the job:
17/11/27 16:35:22 INFO mapreduce.Job: Running job:
and never gets past this state.
In the job tracker, it says
ACCEPTED: waiting for AM container to be allocated, launched and
register with RM.
I then looked into the log files where I found
2017-11-27 13:50:29,202 INFO org.apache.hadoop.conf.Configuration: found resource capacity-scheduler.xml at file:/usr/local/hadoop/etc/hadoop/capacity-scheduler.xml
2017-11-27 13:50:29,252 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: max alloc mb per queue for root is undefined
2017-11-27 13:50:29,252 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: max alloc vcore per queue for root is undefined
2017-11-27 13:50:29,256 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: root, capacity=1.0, asboluteCapacity=1.0, maxCapacity=1.0, asboluteMaxCapacity=1.0, state=RUNNING, acls=ADMINISTER_QUEUE:*SUBMIT_APP:*, labels=*, reservationsContinueLooking=true
2017-11-27 13:50:29,256 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Initialized parent-queue root name=root, fullname=root
2017-11-27 13:50:29,265 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: max alloc mb per queue for root.default is undefined
2017-11-27 13:50:29,265 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: max alloc vcore per queue for root.default is undefined
which suggest that there is a problem with the capacity scheduler. The file capacity-scheduler.xml looks as follows:
Maximum number of applications that can be pending and running.
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
The ResourceCalculator implementation to be used to compare
Resources in the scheduler.
The default i.e. DefaultResourceCalculator only uses Memory while
DominantResourceCalculator uses dominant-resource to compare
multi-dimensional resources such as Memory, CPU etc.
The queues at the this level (root is the root queue).
<description>Default queue target capacity.</description>
Default queue user limit a percentage from 0.0 to 1.0.
The maximum capacity of the default queue.
The state of the default queue. State can be one of RUNNING or STOPPED.
The ACL of who can submit jobs to the default queue.
The ACL of who can administer jobs on the default queue.
Number of missed scheduling opportunities after which the CapacityScheduler
attempts to schedule rack-local containers.
Typically this should be set to number of nodes in the cluster, By default is setting
approximately number of nodes in one rack which is 40.
A list of mappings that will be used to assign jobs to queues
The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*
Typically this list will be used to map users to queues,
for example, u:%user:%user maps all users to queues with the same name
as the user.
If a queue mapping is present, will it override the value specified
by the user? This can be used by administrators to place jobs in queues
that are different than the one specified by the user.
The default is false.
Every thing fine with Cluster configuration but When it comes to Job execution, RAM provided by t2.micro instance is not enough to run the MapReduce jobs, so better use bigger instances for cluster creation and job execution


Run HDFS pseudo mode in a docker container

I'm trying to run a HDFS under pseudo mode in a docker container, configured with this page: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation, but I didn't use start-all.sh script as it isn't supposed to be able to do ssh, so I manually ran command bin/hdfs --daemon start namenode|datanode to start them one by one. The problem is I can see namenode started successfully, but datanode quited without any error message. the last piece of log from datanode is:
2018-04-09 21:04:03,830 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for [DISK]file:/apps/hadoop/hdfs/data
2018-04-09 21:04:04,188 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2018-04-09 21:04:04,296 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2018-04-09 21:04:04,296 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started
2018-04-09 21:04:04,665 INFO org.apache.hadoop.hdfs.server.common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2018-04-09 21:04:04,667 INFO org.apache.hadoop.hdfs.server.datanode.BlockScanner: Initialized block scanner with targetBytesPerSec 1048576
2018-04-09 21:04:04,671 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is hdfs
2018-04-09 21:04:04,671 INFO org.apache.hadoop.hdfs.server.common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2018-04-09 21:04:04,677 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting DataNode with maxLockedMemory = 0
2018-04-09 21:04:04,733 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened streaming server at /
2018-04-09 21:04:04,735 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwidth is 10485760 bytes/s
2018-04-09 21:04:04,735 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Number threads for balancing is 50
core-site.xml file:
And hdfs-site.xml is
Did I miss any thing from there?
I think it is base image issue, I was using alpine, once I changed to centos, datanode works! must be something missing from alpine, appreciate if anyone knows what is it, as centos based image eventually will much more bigger then alpine.

Resource Manager Has No Nodes

EDIT: I have looked at YARN Resourcemanager not connecting to nodemanager and the solution does not work for me. I have attached the section of the node-manager log where a connection to the resource manager is made:
[main] client.RMProxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at /
2016-06-17 19:01:04,697 INFO [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:getNMContainerStatuses(429)) - Sending out 0 NM container statuses: []
2016-06-17 19:01:04,701 INFO [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(268)) - Registering with RM using containers :[]
2016-06-17 19:01:05,815 INFO [main] ipc.Client (Client.java:handleConnectionFailure(867)) - Retrying connect to server: Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-17 19:01:06,816 INFO [main] ipc.Client (Client.java:handleConnectionFailure(867)) - Retrying connect to server: Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
For some reason it says it is connecting to When I ssh into one of the data nodes and ping resource-manager I get a response so it is able to resolve the hostname.
This leads me to believe that an options is incorrect in my yarn-site.xml as my nodes are trying to connect to instead of the resource-manager:8031
I am running a Cloudera hadoop cluster on dockers and am having issues with the Yarn resource manager being able to see the other nodes. They way it is set up is as follows:
Node1 - Namenode (hadoop-hdfs-namenode)
Node 2 - Secondary Namenode (hadoop-hdfs-secondarynamenode)
Node 3 - Yarn Resource-Manager (hadoop-yarn-resourcemanager)
Node 4 - datanode and node manager (hadoop-hdfs-datanode, hadoop-yarn-nodemanager)
Node 5 - datanode and node manager (hadoop-hdfs-datanode, hadoop-yarn-nodemanager)
When I go to namenode:50070 I am able to see both nodes. However, when I go to the resource-manager:8088 it shows I have zero nodes. My yarn-site.xml file which is on every node is as follows:
<description>Classpath for typical applications.</description>
<description>Where to aggregate logs</description>
Number of seconds after an application finishes before the nodemanager's
DeletionService will delete the application's localized file directory
and log directory.
To diagnose Yarn application problems, set this property's value large
enough (for example, to 600 = 10 minutes) to permit examination of these
directories. After changing the property's value, you must restart the
nodemanager in order for it to have an effect.
The roots of Yarn applications' work directories is configurable with
the yarn.nodemanager.local-dirs property (see below), and the roots
of the Yarn applications' log directories is configurable with the
yarn.nodemanager.log-dirs property (see also below).
Does anyone have any ideas as to why this is the case?
As indicated in the edit it appeared as if the yarn-site.xml was not being picked up and only defaults were happening. I solved this be copying the yarn-site.xml file into every directory on the machine as user root. I then ran the node-manager as to make it error reading the file as it does not run under user root. The log directed me to where it expected the file which was in a yarn specific directory instead of the general hadoop directory.

How to optimize and tune hadoop cluster performance

I am not very familiar with hadoop cluster configs and I have recently integrated Apache Nutch with Apache Hadoop and I have crawled data indexed in Solr successfully.
I have my master-slave sources as below:
CPU : 4 cores
memory :12G
hard disk : 37G
Slave1 :
CPU : 2 cores
memory :4G
hard disk : 18G
CPU : 2 cores
memory :4G
hard disk : 16G
Slave3 :
CPU : 2 cores
memory :4G
hard disk : 16G
Slave4 :
CPU : 4 cores
memory :4G
hard disk : 50G
I have configed core-site.xml, mapred-site.xml, hdfs-site.xml, masters and slaves.
Here is my core-site.xml :
<value>/usr/local/My Project Name/hadoop-datastore</value>
<description>store data</description>
<description>the name of default file system</description>
Here is my mapred-site.xml :
<description>host and port</description>
And here is my hdfs-site.xml:
<description>default block</description>
And here is my conf/masters :
And finally my conf/slaves:
This story goes well: When I run master and run the Jps command, I have the folowings on master:
19031 TaskTracker
18644 DataNode
18764 SecondaryNameNode
18884 JobTracker
13226 Jps
18506 NameNode
And when I run the Jps command on all the slaves, I have the followings:
4969 DataNode
5057 TaskTracker
5592 Jps
When I look at Master Hadoop Map/Reduce administration I have the following Cluster Summary:
<h2>Cluster Summary (Heap Size is 114.5 MB/889 MB)</h2>
<table border="1" cellpadding="5" cellspacing="0">
<tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Graylisted Nodes</th><th>Excluded Nodes</th></tr>
The problem is this procedure works fine with topN :1000 but There is load on master with high cpu and memory usage but when I find top on slaves, Neither cpu nor memory has loads. I mean both cpu and memory usage is low and cpu idle is high.
I wonder whether it is natural and OK or not. I am looking for some solutions and configs so that I am able to share the load on all slaves and make the procedure faster.
Any links, documentations and solutions are very much appreciated.
Your master node is running a lot of services :
TaskTracker DataNode SecondaryNameNode JobTracker NameNode
Typically in a decent sized cluster the Master would not have the datanode service.
Name Node & secondary Name node should be on different nodes. You can set secondary name node on one of your data nodes.
Similarly Task Tracker - Master typically does not have task Tracker. I.e. you do not run MR tasks on Master.
On the other hand for pure experimentation the setup you have done is ok & the CPU usage you are noticing is obvious.
I found an error about version 1.2.1 looking deeply at logs directory, saying this version is a 1.2.1 snapshot version. So I changed the server, installing simply version 1.2.1 and making all slaves and master similar in version. That fixed my problem. Now happily I have five nodes equal to the count of my machines.
Hadoop YARN job is getting stucked at map 0% and reduce 0%

I am trying to run a very simple job to test my hadoop setup so I tried with Word Count Example , which get stuck in 0% , so i tried some other simple jobs and each one of them stuck
14/07/14 23:55:51 INFO mapreduce.Job: Running job: job_1405376352191_0003
14/07/14 23:55:57 INFO mapreduce.Job: Job job_1405376352191_0003 running in uber mode : false
14/07/14 23:55:57 INFO mapreduce.Job: map 0% reduce 0%
I am using hadoop version- Hadoop 2.3.0-cdh5.0.2
I did quick research on Google and found to increase
I am having single node cluster, running in my Macbook with dual core and 8 GB Ram.
my yarn-site.xml file -
<!-- Site specific YARN configuration properties -->
<description>Classpath for typical applications.</description>
<description>Where to aggregate logs</description>
<description>shuffle service that needs to be set for Map Reduce to run </description>
<description>Execution framework.</description>
<description>The number of virtual cores required for each map task.</description>
<description>Larger resource limit for maps.</description>
<description>Heap-size for child jvms of maps.</description>
<description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
<description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
<description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
<description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
<description>Physical memory, in MB, to be made available to running containers</description>
<description>Number of CPU cores that can be allocated for containers.</description>
<description>shuffle service that needs to be set for Map Reduce to run </description>
my mapred-site.xml
has only 1 property.
tried several permutation and combinations but couldn't get rid of the error.
Log of the job
23:55:55,694 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2014-07-14 23:55:55,697 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2014-07-14 23:55:55,699 INFO [main] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at /
2014-07-14 23:55:55,769 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: maxContainerCapability: 8092
2014-07-14 23:55:55,769 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: queue: root.abhishekchoudhary
2014-07-14 23:55:55,775 INFO [main] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500
2014-07-14 23:55:55,777 INFO [main] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: yarn.client.max-nodemanagers-proxies : 500
2014-07-14 23:55:55,787 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1405376352191_0003Job Transitioned from INITED to SETUP
2014-07-14 23:55:55,789 INFO [CommitterEvent Processor #0] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_SETUP
2014-07-14 23:55:55,800 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1405376352191_0003Job Transitioned from SETUP to RUNNING
2014-07-14 23:55:55,823 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1405376352191_0003_m_000000 Task Transitioned from NEW to SCHEDULED
2014-07-14 23:55:55,824 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1405376352191_0003_m_000001 Task Transitioned from NEW to SCHEDULED
2014-07-14 23:55:55,824 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1405376352191_0003_m_000002 Task Transitioned from NEW to SCHEDULED
2014-07-14 23:55:55,825 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1405376352191_0003_m_000003 Task Transitioned from NEW to SCHEDULED
2014-07-14 23:55:55,826 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1405376352191_0003_m_000000_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2014-07-14 23:55:55,827 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1405376352191_0003_m_000001_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2014-07-14 23:55:55,827 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1405376352191_0003_m_000002_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2014-07-14 23:55:55,827 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1405376352191_0003_m_000003_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2014-07-14 23:55:55,828 INFO [Thread-49] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: mapResourceReqt:8092
2014-07-14 23:55:55,858 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1405376352191_0003, File: hdfs://localhost/tmp/hadoop-yarn/staging/abhishekchoudhary/.staging/job_1405376352191_0003/job_1405376352191_0003_1.jhist
2014-07-14 23:55:56,773 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
2014-07-14 23:55:56,799 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1405376352191_0003: ask=1 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:0, vCores:0> knownNMs=1
Based on the messsage Connecting to ResourceManager at /,
are you sure your ResourceManager is supposed to be at (the default)?
If not you should add the following to your yarn-site.xml:
<value>MASTER ADDRESS</value>
Replace MASTER ADDRESS with the address of the master node. You can individually change the address of the resource manager's webapp, admin, etc.
Your settings appear to be incorrect.
The setting yarn.nodemanager.resource.memory-mb
is set to 2GB. This is the "amount of physical memory, in MB, that can be allocated for containers." But your mapreduce.map.memory.mb is 8GB. 8GB is what you're really requesting.
Additionally, you have set yarn.app.mapreduce.am.resource.mb to 8GB. As such, you're trying to allocate an AM which controls the job at 8GB plus several mappers at 8GB.
To solve the issue, you can drop the AM size to 1GB and then the mapper size to .5GB, which is a more reasonable size for playing around especially for word count.
Additional resources
You can refer to this instruction provided by Clouera to understand these properties in more detail.
I don't know if you simply made a copy/paste error when creating this question but looking at your yarn-site.xml it starts with two <property> tags. I'm not sure if Hadoop's xml parser will actually apply those nested <property> tags.
I am using Apache Hadoop version 2.7.2 so it might be like "apples-to-oranges" comparison, however I ran into the same silent stuck state the other day. In most of the cases this "silence" for an extended period of time indicates that the scheduler is not able to allocate enough resources to the application.
In my specific case with a similar configuration, increasing the value for property yarn.nodemanager.resource.memory-mb in yarn-site.xml did the trick.
You can also check other properties for resource allocation here

yarn hadoop 2.4.0: info message: ipc.Client Retrying connect to server

i've searched for two days for a solution. but nothing worked.
First, i'm new to the whole hadoop/yarn/hdfs topic and want to configure a small cluster.
the message above doesn't show up everytime i run an example from the mapreduce-examples.jar
sometimes teragen works, sometimes not.
in some cases the whole job failed, in others the job finishes successfully. sometimes the job failes, without printing the message above.
14/06/08 15:42:46 INFO ipc.Client: Retrying connect to server: FQDN-HOSTNAME/XXX.XX.XX.XXX:53022. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
this message is print 30 times. also the port (in code example: 53022) changes with every time a job is started.
if job finished succesfuly, this is print
14/06/08 15:34:20 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 running in uber mode : false
14/06/08 15:34:20 INFO mapreduce.Job: map 100% reduce 100%
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 completed successfully
if it fails,this is shown.
INFO mapreduce.Job: Job job_1402234146062_0005 failed with state FAILED due to: Task failed task_1402234146062_0005_m_000002
Job failed as tasks failed. failedMaps:1 failedReduces:0
in this case, some tasks failed. but in log files of nodemanager, datanode, resourcemanager, ... is no reason or message to find.
INFO mapreduce.Job: Task Id : attempt_1402234146062_0006_m_000002_1, Status : FAILED
Additional Information about my Configuration:
used OS: centOS 6.5
Java Version: OpenJDK Runtime Environment (rhel- u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
<!-- Site specific YARN configuration properties -->
<name>dfs.permissions </name>
<value>false </value>
The job finishes sometimes successfully because when you have one reducer and that reduce task by chance is sent to a working node manager then it becomes successful job.
You have to make sure that FQDN-HOSTNAME is written exactly the same way in the slaves file. If I remember correctly, my solution was that I removed the entry for the hostname mapping in /etc/hosts, that is commenting it out like this:
This is a bug in how the MR AppMaster starts up with ephemeral ports. It exists in Hadoop 2.6.0 release version as well.
I have figured out a fix to this bug and created a JIRA on the MAPREDUCE project along with a comment on how to fix it.
Another possible solution for this, is to check for the firewall in all the nodes.
If you're dealing with iptables, you can run this on every node:
# /etc/init.d/iptables save
# /etc/init.d/iptables stop
That will stop the firewall until next restart, but it should be enough for you to test the cluster. You don't have to restart yarn or anything, just run the job again.
If you want to completely stop the FW:
# chkconfig iptables off
Definitely a bug, this post provides a clearer insight into what is happening.
We are planning on getting around this issue by reducing the ephemeral port range, thus limiting what ports are grabbed, and then configuring iptables to allow for that port range. Setting the port ranges is explained here -
if you see a message like
INFO ipc.Client: Retrying connect to server: <hostname>/<ip>:<port>. Already tried 1 time(s); maxRetries=3
Need to check:
check your firewall between client and Node Manager
check yarn.app.mapreduce.am.job.client.port-range by default the he range is all possible ports
Wow! Are these answers for real?? Talking about FQDN when the job clearly completes...as long as firewall is disabled?? And the OP even put the detailed log messages / configuration.
The problem is that yarn.app.mapreduce.am.job.client.port-range is not being honored. I'm running into it also.
Firewall off...all is well (and I can see the ephemeral ports from yarn job).
Firewall on...all times outs (eventually).
Horton completely ignores this question on other boards.
So here's a log output from a job which demonstrates the problem. In first case, I have the firewall enabled on the client(s) based on Horton's doc (along with other ports I discovered by looking very closely at my installation). You will see the process timing out...and then all of a sudden working. Because I disabled the firewall after watching the job output :)
2015-01-15 16:48:22,943 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: de-luster-l2723nraqsy5-ywhniidze3lb-qfk4asn77vc5/ Already tried 39 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-01-15 16:48:23,349 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /hadoop/yarn/local/usercache/l.admin/appcache/application_1420482341308_0020
2015-01-15 16:48:24,122 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2015-01-15 16:48:24,656 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2015-01-15 16:48:24,724 INFO [main] org.apache.hadoop.mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#7f94ee59
2015-01-15 16:48:24,792 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=534354336, maxSingleShuffleLimit=133588584, mergeThreshold=352673888, ioSortFactor=100, memToMemMergeOutputsThreshold=100
Did ya see it?? Problem with timeout...then all of a sudden Shuffle commences. Nothing to do with FQDNs after all :)
