I've set up a Hadoop 2.7.5.HA cluster and running Flink 1.4.0 applications using the default YARN queue. I decided to categoruze applications and run them on exclusive nodemanagers, so I labeled three nodes, each 4 core and 2GB RAM as stream in queue streamQ and three nodes each 1 core and 1GB RAM as onlinein queue onlineQ and all the settings are displayed in YARN webUI as desired and nodes are identified.
Here is the capacity-scheduler.xml:
<property>
<name>yarn.scheduler.capacity.maximum-applications</name>
<value>10000</value>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.1</value>
</property>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
</property>
<property>
<name>yarn.scheduler.capacity.node-locality-delay</name>
<value>40</value>
</property>
<property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value></value>
</property>
<property>
<name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
<value>false</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>streamQ,onlineQ</value>
</property>
<!-- streamQ settings -->
<property>
<name>yarn.scheduler.capacity.root.streamQ.capacity</name>
<value>0</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels</name>
<value>stream</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels.stream.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels.stream.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.default-node-label-expression</name>
<value>stream</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.acl_submit_applications</name>
<value>*</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.acl_administer_queue</name>
<value>*</value>
</property>
<!-- onlineQ settings -->
<property>
<name>yarn.scheduler.capacity.root.onlineQ.capacity</name>
<value>0</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels</name>
<value>online</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels.online.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels.online.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.default-node-label-expression</name>
<value>online</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.acl_submit_applications</name>
<value>*</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.acl_administer_queue</name>
<value>*</value>
</property>
I run the command to start the Flink session on an edge node with all hadoop configuration the same as cluster:
yarn-session.sh -n 2 -jm 768 -tm 768 -nm flink -z flink_zoo -s 3 -qu streamQ
it successfully uploads Flink libs on HDFS and in YARN webUI I can see the application, but when it attempts to get resources, it says:
018-01-28 10:02:04,087 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
Here is the whole logs:
2018-01-28 10:00:09,648 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2018-01-28 10:00:09,649 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 768
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 768
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.port, 8081
2018-01-28 10:00:10,003 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-01-28 10:00:10,069 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to manager (auth:SIMPLE)
2018-01-28 10:00:10,377 INFO org.apache.flink.yarn.YarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=768, taskManagerMemoryMB=768, numberTaskManagers=2, slotsPerTaskManager=3}
2018-01-28 10:00:10,747 WARN org.apache.flink.yarn.YarnClusterDescriptor - The configuration directory ('/opt/flink/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
2018-01-28 10:00:10,751 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/conf/log4j.properties to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/log4j.properties
2018-01-28 10:00:11,123 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/log4j-1.2.17.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/log4j-1.2.17.jar
2018-01-28 10:00:11,384 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/flink-dist_2.11-1.4.0.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/flink-dist_2.11-1.4.0.jar
2018-01-28 10:00:30,986 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/flink-shaded-hadoop2-uber-1.4.0.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/flink-shaded-hadoop2-uber-1.4.0.jar
2018-01-28 10:00:40,852 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/flink-python_2.11-1.4.0.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/flink-python_2.11-1.4.0.jar
2018-01-28 10:00:41,017 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/slf4j-log4j12-1.7.7.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/slf4j-log4j12-1.7.7.jar
2018-01-28 10:00:41,250 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/conf/logback.xml to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/logback.xml
2018-01-28 10:00:41,386 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/flink-dist_2.11-1.4.0.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/flink-dist_2.11-1.4.0.jar
2018-01-28 10:01:02,966 INFO org.apache.flink.yarn.Utils - Copying from /tmp/application_1517118829753_0002-flink-conf.yaml285707454205346702.tmp to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/application_1517118829753_0002-flink-conf.yaml285707454205346702.tmp
2018-01-28 10:01:03,601 INFO org.apache.flink.yarn.YarnClusterDescriptor - Submitting application master application_1517118829753_0002
2018-01-28 10:01:03,782 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1517118829753_0002
2018-01-28 10:01:03,783 INFO org.apache.flink.yarn.YarnClusterDescriptor - Waiting for the cluster to be allocated
2018-01-28 10:01:03,796 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deploying cluster, current state ACCEPTED
What is the problem?
Editing the capacity-scheduler.xml, solved the problem:
<!-- configuration of queue-root -->
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>streamQ,onlineQ</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels</name>
<value>*</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.stream.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.online.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default-node-label-expression</name>
<value>*</value>
</property>
<!-- configuration of queue-streamQ -->
<property>
<name>yarn.scheduler.capacity.root.streamQ.capacity</name>
<value>50</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels</name>
<value>stream</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels.stream.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels.online.capacity</name>
<value>0</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.default-node-label-expression</name>
<value>stream</value>
</property>
<!-- configuration of queue-streamQ -->
<property>
<name>yarn.scheduler.capacity.root.onlineQ.capacity</name>
<value>50</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels</name>
<value>online</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels.online.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels.stream.capacity</name>
<value>0</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.default-node-label-expression</name>
<value>online</value>
</property>
</configuration>
Please check your flink app logs to see if there is some issue when connect to yarn resourcemanager. I also encounted the issue when I use flink on yarn with HA.I am not sure if I was the only one.
Related
I am trying to deploy hadoop using Apache Bigtop 3.1.1 with puppet.Hadoop version is 3.2.4 .
OS I am using is CentOS 7.
Deployment works fine without kerberos. But with kerberos, hadoop datanode not stating up. Tried to start it manually with
sudo systemctl start hadoop-hdfs-datanode.service and
this is the error it gives:
Nov 21 12:15:59 master.local hadoop-hdfs-datanode[13646]: ERROR: You must be a privileged user in order to run a secure service. Nov 21 12:16:04 master.local hadoop-hdfs-datanode[13646]: Failed to start Hadoop datanode. Return value: 3[FAILED] Nov 21 12:16:04 master.local systemd[1]: hadoop-hdfs-datanode.service: control process exited, code=exited status=3 Nov 21 12:16:04 master.local systemd[1]: Failed to start LSB: Hadoop datanode.
This is my site.yaml file
---
bigtop::hadoop_head_node: "master.local"
hadoop::hadoop_storage_dirs:
- /data/1
- /data/2
- /data/3
- /data/4
hadoop_cluster_node::cluster_components:
- hdfs
- spark
- hive
- tez
- sqoop
- zookeeper
- kafka
- livy
- oozie
- zeppelin
- solrcloud
- kerberos
- httpfs
bigtop::bigtop_repo_uri: "http://10.42.65.70:90/bigtop/3.1.1/rpm/"
# - "https://archive.apache.org/dist/bigtop/bigtop-3.1.1/repos/centos-7/"
hadoop::common_hdfs::hadoop_http_authentication_signature_secret: "FaztheBits123!"
# Kerberos
hadoop::hadoop_security_authentication: "kerberos"
kerberos::krb_site::domain: "bigtop.apache.org"
kerberos::krb_site::realm: "BIGTOP.APACHE.ORG"
kerberos::krb_site::kdc_server: "%{hiera('bigtop::hadoop_head_node')}"
kerberos::krb_site::kdc_port: "88"
kerberos::krb_site::admin_port: "749"
kerberos::krb_site::keytab_export_dir: "/var/lib/bigtop_keytabs"
hadoop::common_hdfs::hadoop_http_authentication_type: "%{hiera('hadoop::hadoop_security_authentication')}"
# to enable tez in hadoop, uncomment the lines below
hadoop::common::use_tez: true
hadoop::common_mapred_app::mapreduce_framework_name: "yarn-tez"
# to enable tez in hive, uncomment the lines below
hadoop_hive::common_config::hive_execution_engine: "tez"
I tried changing workers in /etc/hadoop/conf/workers with proper hostnames. My core-site.xml is this:
<configuration>
<property>
<!-- URI of NN. Fully qualified. No IP.-->
<name>fs.defaultFS</name>
<value>hdfs://master.local:8020</value>
</property>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>hudson,testuser,root,hadoop,jenkins,oozie,hive,httpfs,users</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>hudson,testuser,root,hadoop,jenkins,oozie,hive,httpfs,users</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>hudson,testuser,root,hadoop,jenkins,oozie,hive,httpfs,users</value>
</property>
<!-- enable proper authentication instead of static mock authentication as
Dr. Who -->
<property>
<name>hadoop.http.filter.initializers</name>
<value>org.apache.hadoop.security.AuthenticationFilterInitializer</value>
</property>
<!-- disable anonymous access -->
<property>
<name>hadoop.http.authentication.simple.anonymous.allowed</name>
<value>false</value>
</property>
<!-- enable kerberos authentication -->
<property>
<name>hadoop.http.authentication.type</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.http.authentication.kerberos.principal</name>
<value>HTTP/_HOST#BIGTOP.APACHE.ORG</value>
</property>
<property>
<name>hadoop.http.authentication.kerberos.keytab</name>
<value>/etc/HTTP.keytab</value>
</property>
<!-- provide secret for cross-service-cross-machine cookie -->
<property>
<name>hadoop.http.authentication.signature.secret.file</name>
<value>/etc/hadoop/conf/hadoop-http-authentication-signature-secret</value>
</property>
<!-- make all services on all hosts use the same cookie domain -->
<property>
<name>hadoop.http.authentication.cookie.domain</name>
<value>local</value>
</property>
<property>
<name>hadoop.security.key.provider.path</name>
<value>kms://http#master.local:9600/kms</value>
</property>
</configuration>
My hdfs-site.xml is:
<configuration>
<!-- non HA -->
<property>
<name>dfs.namenode.rpc-address</name>
<value>master.local:8020</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>master.local:50070</value>
</property>
<property>
<name>dfs.namenode.https-address</name>
<value>master.local:50470</value>
</property>
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<!-- NameNode security config -->
<property>
<name>dfs.https.address</name>
<value>master.local:50475</value>
</property>
<property>
<name>dfs.https.port</name>
<value>50475</value>
</property>
<property>
<name>dfs.namenode.keytab.file</name>
<value>/etc/hdfs.keytab</value> <!-- path to the HDFS keytab -->
</property>
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>hdfs/_HOST#BIGTOP.APACHE.ORG</value>
</property>
<property>
<name>dfs.namenode.kerberos.https.principal</name>
<value>host/_HOST#BIGTOP.APACHE.ORG</value>
</property>
<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/etc/hdfs.keytab</value> <!-- path to the HDFS keytab -->
</property>
<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/_HOST#BIGTOP.APACHE.ORG</value>
</property>
<!-- Secondary NameNode security config -->
<property>
<name>dfs.secondary.http.address</name>
<value>master.local:0</value>
</property>
<property>
<name>dfs.secondary.https.address</name>
<value>master.local:50495</value>
</property>
<property>
<name>dfs.secondary.https.port</name>
<value>50495</value>
</property>
<property>
<name>dfs.secondary.namenode.keytab.file</name>
<value>/etc/hdfs.keytab</value> <!-- path to the HDFS keytab -->
</property>
<property>
<name>dfs.secondary.namenode.kerberos.principal</name>
<value>hdfs/_HOST#BIGTOP.APACHE.ORG</value>
</property>
<property>
<name>dfs.secondary.namenode.kerberos.https.principal</name>
<value>host/_HOST#BIGTOP.APACHE.ORG</value>
</property>
<!-- DataNode security config -->
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>700</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/hdfs.keytab</value> <!-- path to the HDFS keytab -->
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>hdfs/_HOST#BIGTOP.APACHE.ORG</value>
</property>
<property>
<name>dfs.datanode.kerberos.https.principal</name>
<value>host/_HOST#BIGTOP.APACHE.ORG</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/1/hdfs,file:///data/2/hdfs,file:///data/3/hdfs,file:///data/4/hdfs</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/1/namenode,file:///data/2/namenode,file:///data/3/namenode,file:///data/4/namenode</value>
</property>
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
<description>The name of the group of super-users.</description>
</property>
<!-- increase the number of datanode transceivers way above the default of 256
- this is for hbase -->
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>
<!-- Configurations for large cluster -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
I don't seem to find log for datanode. I can find logs for namenode in
/var/log/hadoop-hdfs but not datanode logs.
What am I doing wrong.
In Hadoop 3.1.0 namenode is working but datanode is not working showing below message:
STARTUP_MSG: build = https://github.com/apache/hadoop -r 16b70619a24cdcf5d3b0fcf4b58ca77238ccbe6d; compiled by 'centos' on 2018-03-30T00:00Z
STARTUP_MSG: java = 1.8.0_231
************************************************************/
2019-11-13 20:58:38,398 INFO checker.ThrottledAsyncChecker: Scheduling a check for [DISK]file:/C:/Appliacation/hadoop-3.1.0/data/datanode
2019-11-13 20:58:38,436 WARN checker.StorageLocationChecker: Exception checking StorageLocation [DISK]file:/C:/Appliacation/hadoop-3.1.0/data/datanode
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Ljava/lang/String;)Lorg/apache/hadoop/io/nativeio/NativeIO$POSIX$Stat;
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:455)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfoByNativeIO(RawLocalFileSystem.java:796)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:710)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:678)
at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:191)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:98)
at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:239)
at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:52)
at org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$1.call(ThrottledAsyncChecker.java:142)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-11-13 20:58:38,436 ERROR datanode.DataNode: Exception in secureMain
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
at org.apache.hadoop.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:220)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2762)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2677)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2719)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2863)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2887)
2019-11-13 20:58:38,436 INFO util.ExitUtil: Exiting with status 1: org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
2019-11-13 20:58:38,451 INFO datanode.DataNode: SHUTDOWN_MSG:
I had same issue I had to replace some binaries in bin folder reference Hadoop-3.1.2: Datanode and Nodemanager shuts down also I had done some changes in configuration some configuration files as follows :-
1. Edit file [core-site.xml]
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
2. Edit file hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.dir</name>
<value>file:///C:/hadoop-3.1.0/data/namenode</value>
</property>
<property>
<name>dfs.datanode.dir</name>
<value>file:///C:/hadoop-3.1.0/data/datanode</value>
</property>
</configuration>
3. Edit file workers
localhost
4. Edit file mapred-site.xml
<configuration>
<property>
<name>mapreduce.job.user.name</name>
<value>%USERNAME%</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.apps.stagingDir</name>
<value>/user/%USERNAME%/staging</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>
</configuration>
5. Edit file yarn-site.xml
<configuration>
<property>
<name>yarn.server.resourcemanager.address</name>
<value>0.0.0.0:8020</value>
</property>
<property>
<name>yarn.server.resourcemanager.application.expiry.interval</name>
<value>60000</value>
</property>
<property>
<name>yarn.server.nodemanager.address</name>
<value>0.0.0.0:45454</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.server.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/dep/logs/userlogs</value>
</property>
<property>
<name>yarn.server.mapreduce-appmanager.attempt-listener.bindAddress</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.server.mapreduce-appmanager.client-service.bindAddress</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>-1</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>%HADOOP_CONF_DIR%,%HADOOP_COMMON_HOME%/share/hadoop/common/*,%HADOOP_COMMON_HOME%/share/hadoop/common/lib/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*</value>
</property>
</configuration>
Copy the HDFS Meta data from active name node to standby namenode.
bin/hdfs namenode -bootstrapStandby
19/09/01 02:40:38 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
19/09/01 02:40:38 INFO namenode.NameNode: createNameNode [-bootstrapStandby]
19/09/01 02:40:39 WARN common.Util: Path /home/kenny/hadoop-2.7.3/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
19/09/01 02:40:39 WARN common.Util: Path /home/kenny/hadoop-2.7.3/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
19/09/01 02:40:41 INFO ipc.Client: Retrying connect to server: node-master-2/192.168.1.170:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/09/01 02:40:42 INFO ipc.Client: Retrying connect to server: node-master-2/192.168.1.170:9000. Already tried 1 time(s); retry policy is RetryUpToMaximumCoun
Core-site setting
core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://ha-cluster</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/kenny/hadoop-2.7.3/jn</value>
</property>
hdfs-site setting
hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/kenny/hadoop-2.7.3/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ha-cluster</value>
</property>
<property>
<name>dfs.ha.namenodes.ha-cluster</name>
<value>node-master,node-master-2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ha-cluster.node-master</name>
<value>node-master:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ha-cluster.node-master-2</name>
<value>node-master-2:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.ha-cluster.node-master</name>
<value>node-master:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.ha-cluster.node-master-2</name>
<value>node-master-2:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node-master:8485;node-master-2:8485;node1:8485/ha-cluster</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.ha-cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>node-master:2181,node-master-2:2181,node1:2181 </value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/kenny/.ssh/id_rsa</value>
</property>
I tried to run simple word count as MapReduce job. Everything works fine when run locally (all work done on Name Node). But, when I try to run it on a cluster using YARN (adding mapreduce.framework.name=yarn to mapred-site.conf) job hangs.
I came across a similar problem here:
MapReduce jobs get stuck in Accepted state
Output from job:
*** START ***
15/12/25 17:52:50 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/12/25 17:52:51 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/12/25 17:52:51 INFO input.FileInputFormat: Total input paths to process : 5
15/12/25 17:52:52 INFO mapreduce.JobSubmitter: number of splits:5
15/12/25 17:52:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1451083949804_0001
15/12/25 17:52:53 INFO impl.YarnClientImpl: Submitted application application_1451083949804_0001
15/12/25 17:52:53 INFO mapreduce.Job: The url to track the job: http://hadoop-droplet:8088/proxy/application_1451083949804_0001/
15/12/25 17:52:53 INFO mapreduce.Job: Running job: job_1451083949804_0001
mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.job.tracker</name>
<value>localhost:54311</value>
</property>
<!--
<property>
<name>mapreduce.job.tracker.reserved.physicalmemory.mb</name>
<value></value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>3000</value>
<source>mapred-site.xml</source>
</property> -->
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!--
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3000</value>
<source>yarn-site.xml</source>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>3000</value>
</property>
-->
</configuration>
//I the left commented options - they were not solving the problem
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
What can be the problem?
EDIT:
I tried this configuration (commented) on machines: NameNode(8GB RAM) + 2x DataNode (4GB RAM). I get the same effect: Job hangs on ACCEPTED state.
EDIT2:
changed configuration (thanks #Manjunath Ballur) to:
yarn-site.xml:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-droplet</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-droplet:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-droplet:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-droplet:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-droplet:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-droplet:8088</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,$YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/1/yarn/local,/data/2/yarn/local,/data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/data/1/yarn/logs,/data/2/yarn/logs,/data/3/yarn/logs</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>50</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>390</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>390</value>
</property>
</configuration>
mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>50</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx40m</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>50</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>50</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx40m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx40m</value>
</property>
</configuration>
Still not working.
Additional info: I can see no nodes on cluster preview (similar problem here: Slave nodes not in Yarn ResourceManager )
You should check the status of Node managers in your cluster. If the NM nodes are short on disk space then RM will mark them "unhealthy" and those NMs can't allocate new containers.
1) Check the Unhealthy nodes: http://<active_RM>:8088/cluster/nodes/unhealthy
If the "health report" tab says "local-dirs are bad" then it means you need to cleanup some disk space from these nodes.
2) Check the DFS dfs.data.dir property in hdfs-site.xml. It points the location on local file system where hdfs data is stored.
3) Login to those machines and use df -h & hadoop fs - du -h commands to measure the space occupied.
4) Verify hadoop trash and delete it if it's blocking you.
hadoop fs -du -h /user/user_name/.Trash and hadoop fs -rm -r /user/user_name/.Trash/*
I feel, you are getting your memory settings wrong.
To understand the tuning of YARN configuration, I found this to be a very good source: http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_yarn_tuning.html
I followed the instructions given in this blog and was able to get my jobs running. You should alter your settings proportional to the physical memory you have on your nodes.
Key things to remember is:
Values of mapreduce.map.memory.mb and mapreduce.reduce.memory.mb should be at least yarn.scheduler.minimum-allocation-mb
Values of mapreduce.map.java.opts and mapreduce.reduce.java.opts should be around "0.8 times the value of" corresponding mapreduce.map.memory.mb and mapreduce.reduce.memory.mb configurations. (In my case it is 983 MB ~ (0.8 * 1228 MB))
Similarly, value of yarn.app.mapreduce.am.command-opts should be "0.8 times the value of" yarn.app.mapreduce.am.resource.mb
Following are the settings I use and they work perfectly for me:
yarn-site.xml:
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1228</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>9830</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>9830</value>
</property>
mapred-site.xml
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1228</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx983m</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1228</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1228</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx983m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx983m</value>
</property>
You can also refer to the answer here: Yarn container understanding and tuning
You can add vCore settings, if you want your container allocation to take into account CPU also. But, for this to work, you need to use CapacityScheduler with DominantResourceCalculator. See the discussion about this here: How are containers created based on vcores and memory in MapReduce2?
This has solved my case for this error:
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>100</value>
</property>
Check your hosts file on master and slave nodes. I had exactly this problem. My hosts file looked like this on master node for example
127.0.0.0 localhost
127.0.1.1 master-virtualbox
192.168.15.101 master
I changed it like below
192.168.15.101 master master-virtualbox localhost
So it worked.
These lines
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>100</value>
</property>
in the yarn-site.xml solved my problem since the node will be marked as unhealthy when disk usage is >=95%. Solution mainly suitable for pseudodistributed mode.
You have 512 MB RAM on each of the instance and all your memory configurations in yarn-site.xml and mapred-site.xml are 500 MB to 3 GB. You will not be able to run any thing on the cluster. Change every thing to ~256 MB.
Also your mapred-site.xml is using framework to by yarn and you have job tracker address which is not correct. You need to have resource manager related parameters in yarn-site.xml on a multinode cluster (including resourcemanager web address). With out that, the cluster does not know where your cluster is.
You need to revisit both your xml files.
anyway that's work for me .thank you a lot! #KaP
that's my yarn-site.xml
<property>
<name>yarn.resourcemanager.hostname</name>
<value>MacdeMacBook-Pro.local</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
that's my mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
The first thing is to check yarn resource manager logs. I had searched the Internet about this problem for a very long time, but nobody told me how to find out what is really happening. It's so straightforward and simple to check yarn resource manager logs. I am confused why people ignore logs.
For me, there was a error in log
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=172.16.0.167/172.16.0.167:55622]
That's because I switched wifi network in my work place, so my computer IP changed.
Old question, but I got on the same issue recently and in my case it was due to manually setting the master to local in the code.
Please, search for conf.setMaster("local[*]") and remove it.
Hope it helps.
I had installed all components of cloudera 5 in one machine: name node, datanode, hue, pig, oozie , yarn, hbase ...
I run the following pig script in shell:
sudo -u hdfs pig
and then in the pig shell run
data = LOAD '/user/test/text.txt' as (text:CHARARRAY) ;
DUMP data;
script work well
But when run this script on hue browsers Query Editor/ Pig Editor it stuck and below is the log:
2015-09-14 14:07:06,847 [uber-SubtaskRunner] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://HadoopTestEnv:50030/jobdetails.jsp?jobid=job_1442214247855_0002
2015-09-14 14:07:06,884 [uber-SubtaskRunner] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-09-14 14:07:07,512 [communication thread] INFO org.apache.hadoop.mapred.TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1442214247855_0001_m_000000_0 is : 1.0
Heart beat
2015-09-14 14:07:37,545 [communication thread] INFO org.apache.hadoop.mapred.TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1442214247855_0001_m_000000_0 is : 1.0
Heart beat
2015-09-14 14:08:07,571 [communication thread] INFO org.apache.hadoop.mapred.TaskAttemptListenerImpl - Progress of TaskAttempt attempt_1442214247855_0001_m_000000_0 is : 1.0
Heart beat
I have using the yarn-utils script to support me config the yarn-site.xml and marped-site.xml
python yarn-ulti.spy -c 6 -m 16 -d 1 -k True
Using cores=4 memory=16GB disks=1 hbase=True
Profile: cores=6 memory=12288MB reserved=4GB usableMem=12GB disks=1
Num Container=3
Container Ram=4096MB
Used Ram=12GB
Unused Ram=4GB
yarn.scheduler.minimum-allocation-mb=4096
yarn.scheduler.maximum-allocation-mb=12288
yarn.nodemanager.resource.memory-mb=12288
mapreduce.map.memory.mb=2048
mapreduce.map.java.opts=-Xmx1638m
mapreduce.reduce.memory.mb=4096
mapreduce.reduce.java.opts=-Xmx3276m
yarn.app.mapreduce.am.resource.mb=2048
yarn.app.mapreduce.am.command-opts=-Xmx1638m
mapreduce.task.io.sort.mb=819
Script still hang and heart beat forever , any one help me please !
Here is my config:
yarn-site.xml
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>12288</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>6</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>12288</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx1638m</value>
</property>
Mapred-site.xml
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx768m</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1638m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx3276m</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>819</value>
</property>
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>2</value>
</property>
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>2</value>
</property>
The Pig App submits an Oozie jobs that will use one MR slot in addition to what the script does.
The locking is usually due to submission deadlocks like gotcha #5 or having only one available task slot in your cluster.