YARN Application Master unable to connect to Resource Manager - hadoop

I have a 4 node cluster (1 Namenode/Resource Manager 3 datanodes/node managers)
I am trying to run a simple tez example orderedWordCount
hadoop jar C:\HDP\tez-0.4.0.2.1.1.0-1621\tez-mapreduce-examples-0.4.0.2.1.1.0-1621.jar orderedwordcount sample/test.txt /sample/out
The job gets accepted ,the Application master and container gets setup but on the nodemanager I see these logs
2014-09-10 17:53:31,982 INFO
[ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler]
org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager
at /0.0.0.0:8030
2014-09-10 17:53:34,060 INFO
[ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerEventHandler]
org.apache.hadoop.ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
MILLISECONDS)
After configurable timeout the job fails
I searched for this problem and it always pointed to yarn.resourcemanager.scheduler.address configuration. In all my resource manager node and node managers I have this configuration defined correctly but for some reason its not getting picked up
<property>
<name>yarn.resourcemanager.hostname</name>
<value>10.234.225.69</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>${yarn.resourcemanager.hostname}:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>${yarn.resourcemanager.hostname}:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>${yarn.resourcemanager.hostname}:8033</value>
</property>

It might be possible that your ResourceManager is listening on an IPv6 Port while your worker nodes (i.e NodeManagers) might be using IPv4 to connect to the ResourceManager
To quickly check if this is the case, do a
netstat -aln | grep 8030
If you get something similar to :::8030, then your ResourceManager is indeed listening on an IPv6 Port. If its a IPv4 port, you should see something similar to 0.0.0.0:8030
To fix this, you might want to consider disabling IPv6 on all your machines and try once again.

There is a problem in the Hadoop2 code with configuring the yarn.resourcemanager.scheduler.address e.g.:
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>qadoop-nn001.apsalar.com:8030</value>
</property>
It is currently not properly placed into the 'conf' configuration at
hadoop-2.7.0/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java
To prove the issue, we patched that file to directly inject our scheduler address.
The patch below is a hack. The root cause is with the 'conf' object that needs to load the
property "yarn.resourcemanager.scheduler.address".
#Private
protected static <T> T createRMProxy(final Configuration configuration, final Class<T> protocol, RMProxy instance) throws IOException {
YarnConfiguration conf = (configuration instanceof YarnConfiguration)
? (YarnConfiguration) configuration
: new YarnConfiguration(configuration);
LOG.info("LEE: changing the conf to include yarn.resourcemanager.scheduler.address at 10.1.26.1");
conf.set("yarn.resourcemanager.scheduler.address", "10.1.26.1");
RetryPolicy retryPolicy = createRetryPolicy(conf);
if (HAUtil.isHAEnabled(conf)) {
RMFailoverProxyProvider<T> provider =
instance.createRMFailoverProxyProvider(conf, protocol);
return (T) RetryProxy.create(protocol, provider, retryPolicy);
} else {
InetSocketAddress rmAddress = instance.getRMAddress(conf, protocol);
LOG.info("LEE: Connecting to ResourceManager at " + rmAddress);
T proxy = RMProxy.<T>getProxy(conf, protocol, rmAddress);
return (T) RetryProxy.create(protocol, proxy, retryPolicy);
}
}
EDIT: we solved this problem by adding yarn-site.xml to the CLASSPATH.
there is no need to modify RMProxy.java

It is because your resource manager is not reachable. Try to ping you resource manager from other nodes and see if it works. Maintain these configs consistent across cluster.

Related

httpfs error Operation category READ is not supported in state standby

I am working on hadoop apache 2.7.1 and I have a cluster that consists of 3 nodes
nn1
nn2
dn1
nn1 is the dfs.default.name, so it is the master name node.
I have installed httpfs and started it of course after restarting all the services. When nn1 is active and nn2 is standby I can send this request
http://nn1:14000/webhdfs/v1/aloosh/oula.txt?op=open&user.name=root
from my browser and a dialog of open or save for this file appears, but when I kill the name node running on nn1 and start it again as normal then because of high availability nn1 becomes standby and nn2 becomes active.
So here httpfs should work, even if nn1 becomes stand by, but sending the same request now
http://nn1:14000/webhdfs/v1/aloosh/oula.txt?op=open&user.name=root
gives me the error
{"RemoteException":{"message":"Operation category READ is not supported in state standby","exception":"RemoteException","javaClassName":"org.apache.hadoop.ipc.RemoteException"}}
Shouldn't httpfs overcome nn1 standby status and bring the file? Is that because of a wrong configuration, or is there any other reason?
My core-site.xml is
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
It looks like HttpFs is not High Availability aware yet. This could be due to the missing configurations required for the Clients to connect with the current Active Namenode.
Ensure the fs.defaultFS property in core-site.xml is configured with the correct nameservice ID.
If you have the below in hdfs-site.xml
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
then in core-site.xml, it should be
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
Also configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the currently Active and is serving client requests.
Add this property to hdfs-site.xml
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
Restart the Namenodes and HttpFs after adding the properties in all nodes.

Resource Manager Has No Nodes

EDIT: I have looked at YARN Resourcemanager not connecting to nodemanager and the solution does not work for me. I have attached the section of the node-manager log where a connection to the resource manager is made:
[main] client.RMProxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at /0.0.0.0:8031
2016-06-17 19:01:04,697 INFO [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:getNMContainerStatuses(429)) - Sending out 0 NM container statuses: []
2016-06-17 19:01:04,701 INFO [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(268)) - Registering with RM using containers :[]
2016-06-17 19:01:05,815 INFO [main] ipc.Client (Client.java:handleConnectionFailure(867)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2016-06-17 19:01:06,816 INFO [main] ipc.Client (Client.java:handleConnectionFailure(867)) - Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
For some reason it says it is connecting to 0.0.0.0. When I ssh into one of the data nodes and ping resource-manager I get a response so it is able to resolve the hostname.
This leads me to believe that an options is incorrect in my yarn-site.xml as my nodes are trying to connect to 0.0.0.0:8031 instead of the resource-manager:8031
I am running a Cloudera hadoop cluster on dockers and am having issues with the Yarn resource manager being able to see the other nodes. They way it is set up is as follows:
Node1 - Namenode (hadoop-hdfs-namenode)
Node 2 - Secondary Namenode (hadoop-hdfs-secondarynamenode)
Node 3 - Yarn Resource-Manager (hadoop-yarn-resourcemanager)
Node 4 - datanode and node manager (hadoop-hdfs-datanode, hadoop-yarn-nodemanager)
Node 5 - datanode and node manager (hadoop-hdfs-datanode, hadoop-yarn-nodemanager)
When I go to namenode:50070 I am able to see both nodes. However, when I go to the resource-manager:8088 it shows I have zero nodes. My yarn-site.xml file which is on every node is as follows:
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>resource-manager:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>resource-manager:8030</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
</property>
<property>
<name>yarn.log.aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://namenode:8020/var/log/hadoop-yarn/apps</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>resource-manager:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resource-manager:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>resource-manager:8033</value>
</property>
<property>
<description>
Number of seconds after an application finishes before the nodemanager's
DeletionService will delete the application's localized file directory
and log directory.
To diagnose Yarn application problems, set this property's value large
enough (for example, to 600 = 10 minutes) to permit examination of these
directories. After changing the property's value, you must restart the
nodemanager in order for it to have an effect.
The roots of Yarn applications' work directories is configurable with
the yarn.nodemanager.local-dirs property (see below), and the roots
of the Yarn applications' log directories is configurable with the
yarn.nodemanager.log-dirs property (see also below).
</description>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>600</value>
</property>
</configuration>
Does anyone have any ideas as to why this is the case?
Thanks for reading.
Specify:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master-1</value>
</property>
As indicated in the edit it appeared as if the yarn-site.xml was not being picked up and only defaults were happening. I solved this be copying the yarn-site.xml file into every directory on the machine as user root. I then ran the node-manager as to make it error reading the file as it does not run under user root. The log directed me to where it expected the file which was in a yarn specific directory instead of the general hadoop directory.

Hbase connection about zookeeper error

Environment : Ubuntu 14.04 , hadoop-2.2.0 , hbase-0.98.7
when i start hadoop and hbase(single node mode), both all success (I also check the website 8088 for hadoop, 60010 for hbase)
jps
4507 SecondaryNameNode
5350 HRegionServer
4197 NameNode
4795 NodeManager
3948 QuorumPeerMain
5209 HMaster
4678 ResourceManager
5831 Jps
4310 DataNode
but when i check hbase-hadoop-master-localhost.log, i found a information following
2014-10-23 14:16:11,392 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2014-10-23 14:16:11,426 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
i have google lot of website for that unknown error problem, but i can't solve this problem...
Following is my hadoop and hbase configuration
Hadoop :
salves content : localhost
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:9001</value>
<description>host is the hostname of the resource manager and
port is the port on which the NodeManagers contact the Resource Manager.
</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:9002</value>
<description>host is the hostname of the resourcemanager and port is the port
on which the Applications in the cluster talk to the Resource Manager.
</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
<description>In case you do not want to use the default scheduler</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:9003</value>
<description>the host is the hostname of the ResourceManager and the port is the port on
which the clients can talk to the Resource Manager. </description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value></value>
<description>the local directories used by the nodemanager</description>
</property>
<property>
<name>yarn.nodemanager.address</name>
<value>localhost:9004</value>
<description>the nodemanagers bind to this port</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>10240</value>
<description>the amount of memory on the NodeManager in GB</description>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
<description>directory on hdfs where the application logs are moved to </description>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value></value>
<description>the directories used by Nodemanagers as log directories</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>shuffle service that needs to be set for Map Reduce to run </description>
</property>
</configuration>
Hbase:
hbase-env.sh :
..
export JAVA_HOME="/usr/lib/jvm/java-7-oracle"
..
export HBASE_MANAGES_ZK=true
..
hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8020/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>
regionserver content : localhost
my /etc/hosts content:
127.0.0.1 localhost
#127.0.1.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
I try lots of methods to solve it, but all fail, please help me to solve it, i really need to know how to solve.
Originally, i run a mapreuce program and when map 67% reduce 0%, it print out some INFO and some of INFO is following:
14/10/23 15:50:41 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=60000 watcher=org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher#ce1472
14/10/23 15:50:41 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
14/10/23 15:50:41 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
14/10/23 15:50:41 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x1493be510380007, negotiated timeout = 40000
14/10/23 15:50:43 INFO mapred.LocalJobRunner: map > sort
14/10/23 15:50:46 INFO mapred.LocalJobRunner: map > sort
then it crash.. I think program maybe in dead lock and that is what i want to solve zookeeper problem above.
If want another configuration file i set in hadoop or hbase or others, just tell me, i'll post up.
thanks!
I don't think zookeeper is your problem. You should look at your other logs for more information about your map/reduce job status. Check the datanode and namenode logs for errors along with the yarn log messages through the yarn job tracker ui.
ZooKeeper Messages
Those messages are from zookeeper trying to connect with the Zookeeper sasl client. If sasl is not configured the client will still be able to connect but the connection won't be authenticated.
Error message comes from this file
ZooKeeperSaslClient.java
150 // The user did not override the default context. It might be that they just don't intend to use SASL,
151 // so log at INFO, not WARN, since they don't expect any SASL-related information.
152 String msg = "Will not attempt to authenticate using SASL ";
153 if (runtimeException != null) {
154 msg += "(" + runtimeException + ")";
155 } else {
156 msg += "(unknown error)";
157 }
158 this.configStatus = msg;
159 this.isSASLConfigured = false;
160 }
If you want to get rid of the error you will have to configure zookeeper to use sasl. Sorry don't have any experience with configuration sasl for zookeeper.
ZooKeeper SASL Configuration
Add follwing properties in hbase-site.xml file
<property>
<name>hbase.zookeeper.quorum</name>
<value>192.168.56.101</value> #this is my server ip
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
restart ./start-hbase.sh
This is how I solved it - after I tried including hbase-site.xml in classpath and giving the zookeeper quorum value as -Dhbase.zookeeper.quorum - and they did not work.
I copied hbase-site.xml to the same folder as my jar and then did
jar uf myjar.jar hbase-site.xml
And then I ran hadoop jar myjar.jar Blah
This fixed the problem

yarn hadoop 2.4.0: info message: ipc.Client Retrying connect to server

i've searched for two days for a solution. but nothing worked.
First, i'm new to the whole hadoop/yarn/hdfs topic and want to configure a small cluster.
the message above doesn't show up everytime i run an example from the mapreduce-examples.jar
sometimes teragen works, sometimes not.
in some cases the whole job failed, in others the job finishes successfully. sometimes the job failes, without printing the message above.
14/06/08 15:42:46 INFO ipc.Client: Retrying connect to server: FQDN-HOSTNAME/XXX.XX.XX.XXX:53022. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
this message is print 30 times. also the port (in code example: 53022) changes with every time a job is started.
if job finished succesfuly, this is print
14/06/08 15:34:20 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 running in uber mode : false
14/06/08 15:34:20 INFO mapreduce.Job: map 100% reduce 100%
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 completed successfully
if it fails,this is shown.
INFO mapreduce.Job: Job job_1402234146062_0005 failed with state FAILED due to: Task failed task_1402234146062_0005_m_000002
Job failed as tasks failed. failedMaps:1 failedReduces:0
in this case, some tasks failed. but in log files of nodemanager, datanode, resourcemanager, ... is no reason or message to find.
INFO mapreduce.Job: Task Id : attempt_1402234146062_0006_m_000002_1, Status : FAILED
Additional Information about my Configuration:
used OS: centOS 6.5
Java Version: OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.address</name>
<value>FQDN-HOSTNAME:8050</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.localizer.address</name>
<value>FQDN-HOSTNAME:8040</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>FQDN-HOSTNAME:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>FQDN-HOSTNAME:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>FQDN-HOSTNAME:8032</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions </name>
<value>false </value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///var/data/hadoop/hdfs/nn</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:///var/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:///var/data/hadoop/hdfs/snn</value>
<name>fs.checkpoint.edits.dir</name>
<value>file:///var/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///var/data/hadoop/hdfs/dn</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>/mapred/tempDir</value>
</property>
<property>
<name>mapreduce.cluster.local.dir</name>
<value>/mapred/localDir</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>FQDN-HOSTNAME:10020</value>
</property>
</configuration>
I hope somebody could help me. :)
Thank you,
Norman
The job finishes sometimes successfully because when you have one reducer and that reduce task by chance is sent to a working node manager then it becomes successful job.
You have to make sure that FQDN-HOSTNAME is written exactly the same way in the slaves file. If I remember correctly, my solution was that I removed the entry for the hostname mapping in /etc/hosts, that is commenting it out like this:
#127.0.0.1 FQDN-HOSTNAME
This is a bug in how the MR AppMaster starts up with ephemeral ports. It exists in Hadoop 2.6.0 release version as well.
I have figured out a fix to this bug and created a JIRA on the MAPREDUCE project along with a comment on how to fix it.
https://issues.apache.org/jira/browse/MAPREDUCE-6338
Another possible solution for this, is to check for the firewall in all the nodes.
If you're dealing with iptables, you can run this on every node:
# /etc/init.d/iptables save
# /etc/init.d/iptables stop
That will stop the firewall until next restart, but it should be enough for you to test the cluster. You don't have to restart yarn or anything, just run the job again.
If you want to completely stop the FW:
# chkconfig iptables off
Definitely a bug, this post provides a clearer insight into what is happening.
https://groups.google.com/a/cloudera.org/forum/#!msg/cdh-user/P1rfMQmYVWk/eARZXHUTkW0J
We are planning on getting around this issue by reducing the ephemeral port range, thus limiting what ports are grabbed, and then configuring iptables to allow for that port range. Setting the port ranges is explained here -
http://www.ncftp.com/ncftpd/doc/misc/ephemeral_ports.html
if you see a message like
INFO ipc.Client: Retrying connect to server: <hostname>/<ip>:<port>. Already tried 1 time(s); maxRetries=3
Need to check:
check your firewall between client and Node Manager
check yarn.app.mapreduce.am.job.client.port-range by default the he range is all possible ports
Wow! Are these answers for real?? Talking about FQDN when the job clearly completes...as long as firewall is disabled?? And the OP even put the detailed log messages / configuration.
The problem is that yarn.app.mapreduce.am.job.client.port-range is not being honored. I'm running into it also.
Firewall off...all is well (and I can see the ephemeral ports from yarn job).
Firewall on...all times outs (eventually).
Horton completely ignores this question on other boards.
So here's a log output from a job which demonstrates the problem. In first case, I have the firewall enabled on the client(s) based on Horton's doc (along with other ports I discovered by looking very closely at my installation). You will see the process timing out...and then all of a sudden working. Because I disabled the firewall after watching the job output :)
2015-01-15 16:48:22,943 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: de-luster-l2723nraqsy5-ywhniidze3lb-qfk4asn77vc5/10.0.0.41:52015. Already tried 39 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-01-15 16:48:23,349 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /hadoop/yarn/local/usercache/l.admin/appcache/application_1420482341308_0020
2015-01-15 16:48:24,122 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2015-01-15 16:48:24,656 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2015-01-15 16:48:24,724 INFO [main] org.apache.hadoop.mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#7f94ee59
2015-01-15 16:48:24,792 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=534354336, maxSingleShuffleLimit=133588584, mergeThreshold=352673888, ioSortFactor=100, memToMemMergeOutputsThreshold=100
Did ya see it?? Problem with timeout...then all of a sudden Shuffle commences. Nothing to do with FQDNs after all :)

How to change address 'hadoop jar' command is connecting to?

I have been trying to start a MapReduce job on my cluster with the following command:
bin/hadoop jar myjar.jar MainClass /user/hduser/input /user/hduser/output
But I get the following error over and over again, until connection is refused:
13/08/08 00:37:16 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
I then checked with netstat to see if the service was listening to the correct port:
~> sudo netstat -plten | grep java
tcp 0 0 10.1.1.4:54310 0.0.0.0:* LISTEN 10022 38365 11366/java
tcp 0 0 10.1.1.4:54311 0.0.0.0:* LISTEN 10022 32164 11829/java
Now I notice that my service is listening to port 10.1.1.4:54310, which is the IP of my master, but it seems that the 'hadoop jar' command is connecting to 127.0.0.1 (the localhost, which is the same machine) but therefore doesn't find the service. Is there anyway to force 'hadoop jar' to look at 10.1.1.4 instead of 127.0.0.1?
My NameNode, DataNode, JobTracker, TaskTracker, ... are all running. I even checked for DataNode and TaskTracker on the slaves and it all seems to be working. I can check the WebUI on the master and it shows my cluster is online.
I expect the problem to be DNS related since it seems that the 'hadoop jar' command finds the correct port, but always uses the 127.0.0.1 address instead of the 10.1.1.4
UPDATE
Configuration in core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
Configuration in mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
Configuration in hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
Although it seemed to be a DNS issue, it was actually Hadoop trying to resolve a reference to localhost in the code. I was deploying the jar of someone else and assumed it was correct. Upon further inspection I found the reference to localhost and changed it to master, solving my issue.

Resources