Jobs jumps from RUNNING to PREP state - hadoop

When I run a mapreduce job, it jumps from RUNNING to PREP state. I have looked to the mapreduce logs and I haven't found any exception. I am wondering if this is a problem related to the yarn configuration. So, I have looked to the configuration of the mapred-site.xml [2], and it seems that the memory size is correct. I am running in a PC with 16 cores and 64GB of RAM, although I have set mapreduce to run with 32GB (<name>yarn.nodemanager.resource.memory-mb</name> <value>32218</value>). Any suggestion to try to debug this?
[1] Job status
Total jobs:1
JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
job_1379101056979_0001 PREP 1379101096477 root default NORMAL 0 0 0M 0M
[2] mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
<property> <name>mapreduce.jobhistory.done-dir</name> <value>/root/Programs/hadoop/logs/history/done</value> </property>
<property> <name>mapreduce.jobhistory.intermediate-done-dir</name> <value>/root/Programs/hadoop/logs/history/intermediate-done-dir</value> </property>
<property> <name>mapreduce.job.reduces</name> <value>4</value> </property>
<!-- property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8240</value> </property -->
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>24240</value> </property>
<property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property>
<!-- property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>
<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property -->
</configuration>
I don't know what is happening to this, so I post here part the log of a job. I notice that the container where the job is running got a CONTAINER_STOP signal. Anyone can help me what is going on?
2016-10-17 09:57:23,233 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1476697963637_0001_01_000022
2016-10-17 09:57:23,233 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=ubuntu IP=172.30.0.231 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1476697963637_0001 CONTAINERID=container_1476697963637_0001_01_000022
2016-10-17 09:57:23,263 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1476697963637_0001_01_000020 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2016-10-17 09:57:23,263 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1476697963637_0001_01_000022 transitioned from RUNNING to KILLING
2016-10-17 09:57:23,321 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1476697963637_0001_01_000022
2016-10-17 09:57:23,341 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /home/ubuntu/tmp/hadoop-temp/nm-local-dir/usercache/ubuntu/appcache/application_1476697963637_0001/container_1476697963637_0001_01_000020
2016-10-17 09:57:23,404 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 27978 for container-id container_1476697963637_0001_01_000042: 263.0 MB of 1 GB physical memory used; 1.8 GB of 2.1 GB virtual memory used
2016-10-17 09:57:23,559 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=ubuntu OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1476697963637_0001 CONTAINERID=container_1476697963637_0001_01_000020
2016-10-17 09:57:23,559 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1476697963637_0001_01_000020 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2016-10-17 09:57:23,559 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1476697963637_0001_01_000020 from application application_1476697963637_0001
2016-10-17 09:57:23,559 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1476697963637_0001_01_000020 for log-aggregation
2016-10-17 09:57:23,559 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1476697963637_0001
2016-10-17 09:57:23,570 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1476697963637_0001_01_000022 is : 143
2016-10-17 09:57:23,571 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1476697963637_0001_01_000022 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2016-10-17 09:57:23,571 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /home/ubuntu/tmp/hadoop-temp/nm-local-dir/usercache/ubuntu/appcache/application_1476697963637_0001/container_1476697963637_0001_01_000022
2016-10-17 09:57:23,572 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=ubuntu OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1476697963637_0001 CONTAINERID=container_1476697963637_0001_01_000022
2016-10-17 09:57:23,572 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1476697963637_0001_01_000022 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2016-10-17 09:57:23,572 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1476697963637_0001_01_000022 from application application_1476697963637_0001
2016-10-17 09:57:23,572 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1476697963637_0001_01_000022 for log-aggregation
2016-10-17 09:57:23,572 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1476697963637_0001
2016-10-17 09:57:23,670 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 27820 for container-id container_1476697963637_0001_01_000040: 266.3 MB of 1 GB physical memory used; 1.8 GB of 2.1 GB virtual memory used

I had this issue; restarting cloudera and yarn solved it.
If restarting doesn't work, try checking the ports in job.properties - there might be a problem with the namenode and jobtracker ports. Make sure your jobtracker port is correct in the job.properties file.
Also check map-reduce cluster slots. It might be running out of slots.

Related

Run HDFS pseudo mode in a docker container

I'm trying to run a HDFS under pseudo mode in a docker container, configured with this page: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation, but I didn't use start-all.sh script as it isn't supposed to be able to do ssh, so I manually ran command bin/hdfs --daemon start namenode|datanode to start them one by one. The problem is I can see namenode started successfully, but datanode quited without any error message. the last piece of log from datanode is:
...
2018-04-09 21:04:03,830 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for [DISK]file:/apps/hadoop/hdfs/data
2018-04-09 21:04:04,188 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2018-04-09 21:04:04,296 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2018-04-09 21:04:04,296 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started
2018-04-09 21:04:04,665 INFO org.apache.hadoop.hdfs.server.common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2018-04-09 21:04:04,667 INFO org.apache.hadoop.hdfs.server.datanode.BlockScanner: Initialized block scanner with targetBytesPerSec 1048576
2018-04-09 21:04:04,671 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is hdfs
2018-04-09 21:04:04,671 INFO org.apache.hadoop.hdfs.server.common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2018-04-09 21:04:04,677 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting DataNode with maxLockedMemory = 0
2018-04-09 21:04:04,733 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened streaming server at /0.0.0.0:9866
2018-04-09 21:04:04,735 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwidth is 10485760 bytes/s
2018-04-09 21:04:04,735 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Number threads for balancing is 50
core-site.xml file:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost</value>
</property>
</configuration>
And hdfs-site.xml is
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/apps/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/apps/hadoop/hdfs/data</value>
</property>
</configuration>
Did I miss any thing from there?
I think it is base image issue, I was using alpine, once I changed to centos, datanode works! must be something missing from alpine, appreciate if anyone knows what is it, as centos based image eventually will much more bigger then alpine.

Hadoop datanode routing issue on Kubernetes

I'm trying to set up a sample Hadoop cluster on Openshift/Kuberentes/Docker (Openshift 3.5), and i've run into the following issue:
Only one Datanode gets registered on the Namenode at a time, because Namenode sees all datanodes under the same IP (192.168.20.1). This is apparently due to a network route in the cluster
Actual sample configuration:
Namenode
192.168.20.119 hadoop-namenode-10-qp83z
Datanodes
192.168.20.132 hadoop-slave-0.hadoop-slave.my-project.svc.cluster.local hadoop-slave-0
192.168.20.133 hadoop-slave-1.hadoop-slave.my-project.svc.cluster.local hadoop-slave-1
192.168.20.134 hadoop-slave-2.hadoop-slave.my-project.svc.cluster.local hadoop-slave-2
Namenode log:
17/12/05 22:11:21 INFO net.NetworkTopology: Removing a node: /default-rack/192.168.20.1:50010
17/12/05 22:11:21 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.20.1:50010
17/12/05 22:11:21 INFO blockmanagement.BlockReportLeaseManager: Registered DN f3c22144-f9cf-47dc-b0b7-bf946121ee81 (192.168.20.1:50010).
17/12/05 22:11:21 INFO blockmanagement.DatanodeDescriptor: Adding new storage ID DS-6f7b2565-1e85-491a-ab04-69a7ffa25d5c for DN 192.168.20.1:50010
17/12/05 22:11:21 INFO BlockStateChange: BLOCK* processReport 0x9c1289bc1f9f766f: Processing first storage report for DS-6f7b2565-1e85-491a-ab04-69a7ffa25d5c from datanode f3c22144-f9cf-47dc-b0b7-bf946121ee81
17/12/05 22:11:21 INFO BlockStateChange: BLOCK* processReport 0x9c1289bc1f9f766f: from storage DS-6f7b2565-1e85-491a-ab04-69a7ffa25d5c node DatanodeRegistration(192.168.20.1, datanodeUuid=f3c22144-f9cf-47dc-b0b7-bf946121ee81, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-6b84af8f-fe9a-465a-840e-6acb0fe5f8d9;nsid=399770301;c=0), blocks: 0, hasStaleStorage: false, processing time: 0 msecs, invalidatedBlocks: 0
17/12/05 22:11:21 INFO hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(192.168.20.1, datanodeUuid=2bd926b9-b00e-4eb6-858d-3e90fa6b3ef8, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-6b84af8f-fe9a-465a-840e-6acb0fe5f8d9;nsid=399770301;c=0) storage 2bd926b9-b00e-4eb6-858d-3e90fa6b3ef8
17/12/05 22:11:21 INFO namenode.NameNode: BLOCK* registerDatanode: 192.168.20.1:50010
Configuration (hdfs-site.xml):
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>true</value> <!-- same result with false -->
</property>
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value> <!-- same result with false -->
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
Output of ip route on all pods:
ip route
default via 192.168.20.1 dev eth0
192.168.0.0/16 dev eth0
192.168.20.0/24 dev eth0 proto kernel scope link src 192.168.20.134
224.0.0.0/4 dev eth0
The issue is strikingly similar to issue described in Why is Dockerized Hadoop datanode registering with the wrong IP address?, but now in context of Kubernetes cluster
Any ideas?
Does this help?
"Famous last words
Before you scale down the datanode StatefulSet, you need to tell Hadoop that one datanode will go away ;)"
See http://b4mad.net/datenbrei/openshift/hadoop-hdfs/
See also https://gitlab.com/goern/hdfs-openshift

Run Spark-shell with error :SparkContext: Error initializing SparkContext

I install spark on three nodes successfully. I can visit spark web UI and find every worker node and master node is active.
I can run the SparkPi example successfully.
My cluster info:
10.45.10.33(master&worker,hadoop-master,hadoop-slave)
10.45.10.34(worker,hadoop-slave)
10.45.10.35(worker,hadoop-slave)
But when I try to run "spark-shell --master yarn",it gave out the exception:
16/09/12 19:50:29 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101)
at $line3.$read$$iw$$iw.<init>(<console>:15)
at $line3.$read$$iw.<init>(<console>:31)
at $line3.$read.<init>(<console>:33)
at $line3.$read$.<init>(<console>:37)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.$print$lzycompute(<console>:7)
at $line3.$eval$.$print(<console>:6)
at $line3.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
at org.apache.spark.repl.Main$.doMain(Main.scala:68)
at org.apache.spark.repl.Main$.main(Main.scala:51)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/09/12 19:50:29 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
16/09/12 19:50:29 WARN MetricsSystem: Stopping a MetricsSystem that is not running
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101)
... 47 elided
<console>:14: error: not found: value spark
import spark.implicits._
^
<console>:14: error: not found: value spark
import spark.sql
^
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Here is my configuration:
1.spark-env.sh
export JAVA_HOME=/root/Downloads/jdk1.8.0_77
export SPARK_HOME=/root/Downloads/spark-2.0.0-bin-without-hadoop
export HADOOP_HOME=/root/Downloads/hadoop-2.7.2
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/root/Downloads/hadoop-2.7.2/bin/hadoop classpath)
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_LIBARY_PATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native
SPARK_MASTER_HOST=10.45.10.33
SPARK_MASTER_WEBUI_PORT=28686
SPARK_LOCAL_DIRS=/root/Downloads/spark-2.0.0-bin-without-hadoop/sparkdata/local
SPARK_WORKER_DIR=/root/Downloads/spark-2.0.0-bin-without-hadoop/sparkdata/work
SPARK_LOG_DIR=/root/Downloads/spark-2.0.0-bin-without-hadoop/logs
spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://10.45.10.33/spark-event-log
3.slaves
10.45.10.33
10.45.10.34
10.45.10.35
Here is some log info:
yarn job logs:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/Downloads/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/Downloads/hadoop-2.7.2/share/hadoop/common/lib/alluxio-core-client-1.2.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/Downloads/alluxio-master/core/client/target/alluxio-core-client-1.2.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/09/14 11:21:08 INFO SignalUtils: Registered signal handler for TERM
16/09/14 11:21:08 INFO SignalUtils: Registered signal handler for HUP
16/09/14 11:21:08 INFO SignalUtils: Registered signal handler for INT
16/09/14 11:21:14 INFO ApplicationMaster: Preparing Local resources
16/09/14 11:21:15 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
yarn logs on runnong node:
2016-09-14 01:26:41,321 WARN alluxio.logger.type: Worker Client last execution took 2271 ms. Longer than the interval 1000
2016-09-14 06:13:10,905 WARN alluxio.logger.type: Worker Client last execution took 1891 ms. Longer than the interval 1000
2016-09-14 08:41:36,122 WARN alluxio.logger.type: Worker Client last execution took 1625 ms. Longer than the interval 1000
2016-09-14 10:41:49,426 WARN alluxio.logger.type: Worker Client last execution took 2441 ms. Longer than the interval 1000
2016-09-14 11:18:44,355 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1473752235721_0009_000002 (auth:SIMPLE)
2016-09-14 11:18:45,319 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1473752235721_0009_02_000001 by user root
2016-09-14 11:18:45,447 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1473752235721_0009
2016-09-14 11:18:45,601 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=10.45.10.33 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1473752235721_0009 CONTAINERID=container_1473752235721_0009_02_000001
2016-09-14 11:18:45,811 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1473752235721_0009 transitioned from NEW to INITING
2016-09-14 11:18:45,815 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Adding container_1473752235721_0009_02_000001 to application application_1473752235721_0009
2016-09-14 11:18:45,865 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1473752235721_0009 transitioned from INITING to RUNNING
2016-09-14 11:18:46,060 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473752235721_0009_02_000001 transitioned from NEW to LOCALIZING
2016-09-14 11:18:46,060 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1473752235721_0009
2016-09-14 11:18:46,211 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://10.45.10.33:8020/user/root/.sparkStaging/application_1473752235721_0009/__spark_libs__8339309767420855025.zip transitioned from INIT to DOWNLOADING
2016-09-14 11:18:46,211 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://10.45.10.33:8020/user/root/.sparkStaging/application_1473752235721_0009/__spark_conf__.zip transitioned from INIT to DOWNLOADING
2016-09-14 11:18:46,223 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1473752235721_0009_02_000001
2016-09-14 11:18:47,083 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /tmp/hadoop-root/nm-local-dir/nmPrivate/container_1473752235721_0009_02_000001.tokens. Credentials list:
2016-09-14 11:18:47,658 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Initializing user root
2016-09-14 11:18:47,761 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying from /tmp/hadoop-root/nm-local-dir/nmPrivate/container_1473752235721_0009_02_000001.tokens to /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009/container_1473752235721_0009_02_000001.tokens
2016-09-14 11:18:47,765 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Localizer CWD set to /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009 = file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009
2016-09-14 11:20:54,352 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://10.45.10.33:8020/user/root/.sparkStaging/application_1473752235721_0009/__spark_libs__8339309767420855025.zip(->/tmp/hadoop-root/nm-local-dir/usercache/root/filecache/10/__spark_libs__8339309767420855025.zip) transitioned from DOWNLOADING to LOCALIZED
2016-09-14 11:20:55,049 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://10.45.10.33:8020/user/root/.sparkStaging/application_1473752235721_0009/__spark_conf__.zip(->/tmp/hadoop-root/nm-local-dir/usercache/root/filecache/11/__spark_conf__.zip) transitioned from DOWNLOADING to LOCALIZED
2016-09-14 11:20:55,052 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473752235721_0009_02_000001 transitioned from LOCALIZING to LOCALIZED
2016-09-14 11:20:57,298 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473752235721_0009_02_000001 transitioned from LOCALIZED to RUNNING
2016-09-14 11:20:57,509 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009/container_1473752235721_0009_02_000001/default_container_executor.sh]
2016-09-14 11:20:58,338 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1473752235721_0009_02_000001
2016-09-14 11:21:07,134 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26593 for container-id container_1473752235721_0009_02_000001: 50.3 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used
2016-09-14 11:21:15,218 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 26593 for container-id container_1473752235721_0009_02_000001: 90.9 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2016-09-14 11:21:15,224 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1473752235721_0009_02_000001 has processes older than 1 iteration running over the configured limit. Limit=2254857728, current usage = 2424918016
2016-09-14 11:21:15,412 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=26593,containerID=container_1473752235721_0009_02_000001] is running beyond virtual memory limits. Current usage: 90.9 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1473752235721_0009_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 26593 26591 26593 26593 (bash) 1 0 115838976 119 /bin/bash -c /usr/java/jdk1.8.0_91/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009/container_1473752235721_0009_02_000001/tmp -Dspark.yarn.app.container.log.dir=/root/Downloads/hadoop-2.7.2/logs/userlogs/application_1473752235721_0009/container_1473752235721_0009_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg '10.45.10.33:54976' --properties-file /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009/container_1473752235721_0009_02_000001/__spark_conf__/__spark_conf__.properties 1> /root/Downloads/hadoop-2.7.2/logs/userlogs/application_1473752235721_0009/container_1473752235721_0009_02_000001/stdout 2> /root/Downloads/hadoop-2.7.2/logs/userlogs/application_1473752235721_0009/container_1473752235721_0009_02_000001/stderr
|- 26597 26593 26593 26593 (java) 811 62 2309079040 23149 /usr/java/jdk1.8.0_91/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009/container_1473752235721_0009_02_000001/tmp -Dspark.yarn.app.container.log.dir=/root/Downloads/hadoop-2.7.2/logs/userlogs/application_1473752235721_0009/container_1473752235721_0009_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg 10.45.10.33:54976 --properties-file /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009/container_1473752235721_0009_02_000001/__spark_conf__/__spark_conf__.properties
2016-09-14 11:21:15,451 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Removed ProcessTree with root 26593
2016-09-14 11:21:15,469 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473752235721_0009_02_000001 transitioned from RUNNING to KILLING
2016-09-14 11:21:15,471 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1473752235721_0009_02_000001
2016-09-14 11:21:15,891 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1473752235721_0009_02_000001 is : 143
2016-09-14 11:21:19,717 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473752235721_0009_02_000001 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2016-09-14 11:21:19,797 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009/container_1473752235721_0009_02_000001
2016-09-14 11:21:19,811 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - KilleTARGET=ContainerImpl RESULT=SUCCESS APPID=application_1473752235721_0009 CONTAINERID=container_1473752235721_0009_02_000001
2016-09-14 11:21:19,813 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1473752235721_0009_02_000001 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2016-09-14 11:21:19,813 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Removing container_1473752235721_0009_02_000001 from application application_1473752235721_0009
2016-09-14 11:21:19,813 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1473752235721_0009
2016-09-14 11:21:21,458 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1473752235721_0009_02_000001
2016-09-14 11:21:21,531 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1473752235721_0009_02_000001]
2016-09-14 11:21:21,536 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1473752235721_0009 transitioned from RUNNING to APPLICATION_RESOURCES_CLEANINGUP
2016-09-14 11:21:21,572 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_STOP for appId application_1473752235721_0009
2016-09-14 11:21:21,585 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1473752235721_0009 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED
2016-09-14 11:21:21,589 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler: Scheduling Log Deletion for application: application_1473752235721_0009, with delay of 10800 seconds
2016-09-14 11:21:21,592 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1473752235721_0009
How do I solve this problem? Can anyone give some advice?
I was receiveing this ERROR : 'Attempted to request executors before the AM has registered!'
and landed on this page without answer.. If anyone has the same error, for me the solution was to open Spark ports.
On version Spark 3.1.2, running in Ubuntu 20.04 you have to specify some things in the cluster, so the ports donĀ“t be assigned randomly:
in spark-defaults.conf:
spark.driver.bindAddress 10.0.0.1
spark.driver.host 10.0.0.1
spark.shuffle.service.port 7337
spark.ui.port 4040
spark.blockManager.port 31111
spark.driver.blockManager.port 32222
spark.driver.port 33333
in spark-env.sh:
SPARK_LOCAL_IP=10.0.0.1
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
and in workers you put the adresses ot the datanodes.

Mapreduce job ipc.Client retrying to connect

I am testing my hadoop cluster which consists of 4 docker containers:
Datanode
Secondary Namenode
Namenode
Resource Manager
When I submit a map reduce job I notice connection issues once both map and reduce are at 100%. This then reaches the maximum number of re-tries before erroring and providing a stack trace. The weird thing is that the job finishes and provides an answer. However the node manager web interface shows a failed job. None of the question/answers I have found so far fix my particular issue.
All my machines have exposed the port range 50100:50200 to comply with the 'yarn.app.mapreduce.am.job.client.port-range' property.
The job I submit is
sudo -u hdfs hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.1.jar pi 1 1
This is the output:
Number of Maps = 1
Samples per Map = 1
Wrote input for Map #0
Starting Job
16/06/18 19:14:07 INFO client.RMProxy: Connecting to ResourceManager at resource-manager/172.19.0.2:8032
16/06/18 19:14:08 INFO input.FileInputFormat: Total input paths to process : 1
16/06/18 19:14:08 INFO mapreduce.JobSubmitter: number of splits:1
16/06/18 19:14:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1466277178029_0001
16/06/18 19:14:08 INFO impl.YarnClientImpl: Submitted application application_1466277178029_0001
16/06/18 19:14:08 INFO mapreduce.Job: The url to track the job: http://resource-manager:8088/proxy/application_1466277178029_0001/
16/06/18 19:14:08 INFO mapreduce.Job: Running job: job_1466277178029_0001
16/06/18 19:14:15 INFO mapreduce.Job: Job job_1466277178029_0001 running in uber mode : false
16/06/18 19:14:15 INFO mapreduce.Job: map 0% reduce 0%
16/06/18 19:14:19 INFO mapreduce.Job: map 100% reduce 0%
16/06/18 19:14:26 INFO mapreduce.Job: map 100% reduce 100%
16/06/18 19:14:32 INFO ipc.Client: Retrying connect to server: 01d3c03f829a/172.19.0.4:50100. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/06/18 19:14:33 INFO ipc.Client: Retrying connect to server: 01d3c03f829a/172.19.0.4:50100. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/06/18 19:14:34 INFO ipc.Client: Retrying connect to server: 01d3c03f829a/172.19.0.4:50100. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/06/18 19:14:36 INFO mapreduce.Job: map 0% reduce 0%
16/06/18 19:14:36 INFO mapreduce.Job: Job job_1466277178029_0001 failed with state FAILED due to: Application application_1466277178029_0001 failed 2 times due to AM Container for appattempt_1466277178029_0001_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://resource-manager:8088/proxy/application_1466277178029_0001/AThen, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1466277178029_0001_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
16/06/18 19:14:36 INFO mapreduce.Job: Counters: 0
Job Finished in 28.862 seconds
Estimated value of Pi is 4.00000000000000000000
the container log has the following:
2016-06-18 19:14:32,273 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1466277178029_0001_000002
2016-06-18 19:14:32,443 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-06-18 19:14:32,475 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
2016-06-18 19:14:32,477 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (org.apache.hadoop.yarn.security.AMRMTokenIdentifier#3514a4c0)
2016-06-18 19:14:32,515 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Using mapred newApiCommitter.
2016-06-18 19:14:33,060 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Attempt num: 2 is last retry: true because a commit was started.
2016-06-18 19:14:33,061 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$NoopEventHandler
2016-06-18 19:14:33,067 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.jobhistory.EventType for class org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler
2016-06-18 19:14:33,068 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.rm.ContainerAllocator$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter
2016-06-18 19:14:33,118 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,141 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,162 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,183 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Emitting job history data to the timeline server is not enabled
2016-06-18 19:14:33,185 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Will not try to recover. recoveryEnabled: true recoverySupportedByCommitter: false numReduceTasks: 1 shuffleKeyValidForRecovery: true ApplicationAttemptID: 2
2016-06-18 19:14:33,210 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,212 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at hdfs://namenode:9000/user/hdfs/.staging/job_1466277178029_0001/job_1466277178029_0001_1.jhist
2016-06-18 19:14:33,621 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobFinishEvent$Type for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler
2016-06-18 19:14:33,640 WARN [main] org.apache.hadoop.metrics2.impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-mrappmaster.properties,hadoop-metrics2.properties
2016-06-18 19:14:33,689 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2016-06-18 19:14:33,689 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MRAppMaster metrics system started
2016-06-18 19:14:33,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: nodeBlacklistingEnabled:true
2016-06-18 19:14:33,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: maxTaskFailuresPerNode is 3
2016-06-18 19:14:33,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 33
2016-06-18 19:14:33,739 INFO [main] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at resource-manager/172.19.0.2:8030
2016-06-18 19:14:33,814 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: maxContainerCapability: <memory:4096, vCores:4>
2016-06-18 19:14:33,814 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: queue: root.hdfs
2016-06-18 19:14:33,837 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,840 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryCopyService: History file is at hdfs://namenode:9000/user/hdfs/.staging/job_1466277178029_0001/job_1466277178029_0001_1.jhist
2016-06-18 19:14:33,894 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1466277178029_0001, File: hdfs://namenode:9000/user/hdfs/.staging/job_1466277178029_0001/job_1466277178029_0001_2.jhist
2016-06-18 19:14:33,959 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: Was asked to shut down.
2016-06-18 19:14:33,959 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.io.IOException: Was asked to shut down.
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1546)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1540)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1473)
2016-06-18 19:14:33,962 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 1
A few times it says 'Cannot locate configuration' or 'Default file system is set solely by core-default.xml'. Is this significant? In case this changes anything I am using the cloudera repo to install various hadoop services instead of unpacking a .tar.gz.
My config files are:
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:9000</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.hosts</name>
<value>*</value>
</property>
</configuration>
yar-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resource-manager</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>resource-manager:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>resource-manager:8030</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
</property>
<property>
<name>yarn.log.aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://namenode:8020/var/log/hadoop-yarn/apps</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>resource-manager:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resource-manager:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>resource-manager:8033</value>
</property>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>600</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1000</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>namenode:8021</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>history-server:10020</value>
<description>Enter your JobHistoryServer hostname.</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>history-server:19888</value>
<description>Enter your JobHistoryServer hostname.</description>
</property>
<property>
<name>yarn.app.mapreduce.am.job.client.port-range</name>
<value>50100-50200</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.name.dir or dfs.namenode.name.dir</name>
<value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>
</property>
<property>
<name>dfs.data.dir or dfs.datanode.data.dir</name>
<value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/dn,file:///data/4/dfs/dn</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>namenode:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
Thanks for reading.
For anyone who has the same issue the solution is to add the following to the hdfs-site.xml:
<property>
<name>dfs.safemode.threshold.pct</name>
<value>0</value>
</property>

Datanode daemon not starting on datanodes hadoop

I am unable to start datanode daemon on my cluster(version v2.2). It starts fine in master node but simply do not start in data nodes. No log files are created on data nodes,they are created in master-node daemon and no error message. I have made sure below things are right.
I am able to ssh all data nodes from master withought password. I have also set HADOOP_SECURE_DN_USER user to "hadoop" this is the user i am planning to start datanodes on, On all nodes.
I have added data nodes to slaves file, one per line.
HADOOP_HOME(/home/hadoop/hadoop-2.2.0),HADOOP_CONF_DIR($HADOOP_HOME/etc/hadoop) set on ALL the nodes.
all required directories are present on datanodes,users created,ipv6 disabled
Added necessary config file parameters, they are as below -
Below are log files for reference. They dont have any errors. Note "Network topology has 0 racks and 0 datanodes" below suggesting it is not recognizing ALL datanodes(may be safe mode one, not sure). Any help is much appreciated.
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/datanode</value>
</property>
</configuration>
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.log.dirs</name>
<value>/home/yarn/logs</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>1024</value>
</property>
</configuration>
Namenode Log:
2013-12-06 23:54:46,940 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 1 secs
2013-12-06 23:54:46,940 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes
2013-12-06 23:54:46,940 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks
2013-12-06 23:54:46,972 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2013-12-06 23:54:46,972 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 9000: starting
2013-12-06 23:54:46,975 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode RPC up at: localhost/192.168.56.1:9000
2013-12-06 23:54:46,975 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2013-12-06 23:55:08,530 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(192.168.56.1, storageID=DS-1268869381-192.168.56.1-50010-1386350725676, infoPort=50075, ipcPort=50020, storageInfo=lv=-47;cid=CID-d6194959-5a13-4d8b-8428-25134e8fb746;nsid=2144581313;c=0) storage DS-1268869381-192.168.56.1-50010-1386350725676
2013-12-06 23:55:08,535 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.56.1:50010
2013-12-06 23:55:08,717 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: BLOCK* processReport: Received first block report from 192.168.56.1:50010 after starting up or becoming active. Its block contents are no longer considered stale
2013-12-06 23:55:08,718 INFO BlockStateChange: BLOCK* processReport: from DatanodeRegistration(192.168.56.1, storageID=DS-1268869381-192.168.56.1-50010-1386350725676, infoPort=50075, ipcPort=50020, storageInfo=lv=-47;cid=CID-d6194959-5a13-4d8b-8428-25134e8fb746;nsid=2144581313;c=0), blocks: 0, processing time: 2 msecs
Datanode Log(on master node):
2013-12-06 23:55:08,469 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding block pool BP-1981795271-192.168.56.1-1386350567299
2013-12-06 23:55:08,470 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-1981795271-192.168.56.1-1386350567299 on volume /home/hadoop/datanode/current...
2013-12-06 23:55:08,479 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-1981795271-192.168.56.1-1386350567299 on /home/hadoop/datanode/current: 8ms
2013-12-06 23:55:08,479 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Total time to scan all replicas for block pool BP-1981795271-192.168.56.1-1386350567299: 9ms
2013-12-06 23:55:08,479 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding replicas to map for block pool BP-1981795271-192.168.56.1-1386350567299 on volume /home/hadoop/datanode/current...
2013-12-06 23:55:08,479 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to add replicas to map for block pool BP-1981795271-192.168.56.1-1386350567299 on volume /home/hadoop/datanode/current: 0ms
2013-12-06 23:55:08,479 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Total time to add all replicas to map: 0ms
2013-12-06 23:55:08,485 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1981795271-192.168.56.1-1386350567299 (storage id DS-1268869381-192.168.56.1-50010-1386350725676) service to localhost/192.168.56.1:9000 beginning handshake with NN
2013-12-06 23:55:08,560 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1981795271-192.168.56.1-1386350567299 (storage id DS-1268869381-192.168.56.1-50010-1386350725676) service to localhost/192.168.56.1:9000 successfully registered with NN
2013-12-06 23:55:08,560 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode localhost/192.168.56.1:9000 using DELETEREPORT_INTERVAL of 300000 msec BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000
2013-12-06 23:55:08,674 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Namenode Block pool BP-1981795271-192.168.56.1-1386350567299 (storage id DS-1268869381-192.168.56.1-50010-1386350725676) service to localhost/192.168.56.1:9000 trying to claim ACTIVE state with txid=5
2013-12-06 23:55:08,674 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging ACTIVE Namenode Block pool BP-1981795271-192.168.56.1-1386350567299 (storage id DS-1268869381-192.168.56.1-50010-1386350725676) service to localhost/192.168.56.1:9000
2013-12-06 23:55:08,767 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks took 2 msec to generate and 90 msecs for RPC and NN processing
2013-12-06 23:55:08,767 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: sent block report, processed command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand#38568c24
2013-12-06 23:55:08,773 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlockMap
2013-12-06 23:55:08,773 INFO org.apache.hadoop.util.GSet: VM type = 64-bit
2013-12-06 23:55:08,773 INFO org.apache.hadoop.util.GSet: 0.5% max memory = 889 MB
2013-12-06 23:55:08,773 INFO org.apache.hadoop.util.GSet: capacity = 2^19 = 524288 entries
2013-12-06 23:55:08,774 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-1981795271-192.168.56.1-1386350567299
2013-12-06 23:55:08,778 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Added bpid=BP-1981795271-192.168.56.1-1386350567299 to blockPoolScannerMap, new size=1

Resources