HADOOP YARN - Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty - hadoop

I am evaluating YARN for a project. I am trying to get the simple distributed shell example to work. I have gotten the application to the SUBMITTED phase, but it never starts. This is the information reported from this line:
ApplicationReport report = yarnClient.getApplicationReport(appId);
Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty. Details : AM Partition = DEFAULT_PARTITION; AM Resource Request = memory:1024, vCores:1; Queue Resource Limit for AM = memory:0, vCores:0; User AM Resource Limit of the queue = memory:0, vCores:0; Queue AM Resource Usage = memory:128, vCores:1;
The solutions for other developers seems to have to increase yarn.scheduler.capacity.maximum-am-resource-percent in the yarn-site.xml file from its default value of .1. I have tried values of .2 and .5 but it does not seem to help.

Looks like you did not configure the RAM allocated to Yarn in a proper way. This can be a pin in the ..... if you try to infer/adapt from tutorials according to your own installation. I would strongly recommend that you use tools such as this one:
wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
rm hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
mv hdp_manual_install_rpm_helper_files-2.6.0.3.8/ hdp_conf_files
python hdp_conf_files/scripts/yarn-utils.py -c 4 -m 8 -d 1 false
-c number of cores you have for each node
-m amount of memory you have for each node (Giga)
-d number of disk you have for each node
-bool "True" if HBase is installed; "False" if not
This should give you something like:
Using cores=4 memory=8GB disks=1 hbase=True
Profile: cores=4 memory=5120MB reserved=3GB usableMem=5GB disks=1
Num Container=3
Container Ram=1536MB
Used Ram=4GB
Unused Ram=3GB
yarn.scheduler.minimum-allocation-mb=1536
yarn.scheduler.maximum-allocation-mb=4608
yarn.nodemanager.resource.memory-mb=4608
mapreduce.map.memory.mb=1536
mapreduce.map.java.opts=-Xmx1228m
mapreduce.reduce.memory.mb=3072
mapreduce.reduce.java.opts=-Xmx2457m
yarn.app.mapreduce.am.resource.mb=3072
yarn.app.mapreduce.am.command-opts=-Xmx2457m
mapreduce.task.io.sort.mb=614
Edit your yarn-site.xml and mapred-site.xml accordingly.
nano ~/hadoop/etc/hadoop/yarn-site.xml
nano ~/hadoop/etc/hadoop/mapred-site.xml
Moreover, you should have this in your yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>name_of_your_master_node</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
and this in your mapred-site.xml:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Then, upload your conf files to each node using scp (If you uploaded you ssh keys to each one)
for node in node1 node2 node3; do scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/; done
And then, restart yarn
stop-yarn.sh
start-yarn.sh
and check that you can see your nodes:
hadoop#master-node:~$ yarn node -list
18/06/01 12:51:33 INFO client.RMProxy: Connecting to ResourceManager at master-node/192.168.0.37:8032
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
node3:34683 RUNNING node3:8042 0
node2:36467 RUNNING node2:8042 0
node1:38317 RUNNING node1:8042 0
This might fix the issue (good luck) (additional info)

Add below properties to yarn-site.xml and restart dfs and yarn
<property>
<name>yarn.scheduler.capacity.root.support.user-limit-factor</name>
<value>2</value>
</property>
<property>
<name>yarn.nodemanager.disk-health-checker.min-healthy-disks</name>
<value>0.0</value>
</property>
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>100.0</value>
</property>

I got the same error and tried to solve it hard. I realized the resource manager had no resource to allocate the application master (AM) of the MapReduce application.
I navigated on browser http://localhost:8088/cluster/nodes/unhealthy and examined unhealthy nodes (in my case there was only one) -> health report. I saw the warning about that some log directories filled up. I cleaned those directories then my node became healthy and the application state switched to RUNNING from ACCEPTED. Actually, as a default, if the node disk fills up more than %90, YARN behaves like that. Someway you have to clean space and make available space lower than %90.
My exact health report was:
1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /tmp/hadoop-train/nm-local-dir : used space above threshold of 90.0% ] ;
1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /opt/manual/hadoop/logs/userlogs : used space above threshold of 90.0% ]

Related

what would happen if nodes in hadoop change their IP address?

my hadoop clusters do not work fine because of the network conditions.What if i change the entire network,like another router,thus change the IP addresses? could the clusters still work by updating some configurations? or i must torn it down and rebuilt everything?
Thanks in advance
It works once you change the ip addresses into the configuration, why did not you use the DNS?
Ok, it was not a good answer, let me apologize and give a better answer.
If you need to change configuration on a running cluster you can decommission and commission the data nodes.
Switch off the data node is not a good idea.
Data Node Decomissioning
The fist step is tell to yarn you are going to remove some nodes, then you have to say the same to node manager.
I don't know if your system is configured for decommissioning, if it so you have the key yarn.resourcemanager.nodes.exclude-path into the yarn-site.xml and dfs.hosts.exclude into hdfs-site.xml
hdfs-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>$YOUR_PATH/dfs.exclude</value>
<final>true</final>
</property>
yarn-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>$YOUR_PATH/dfs.exclude</value>
<final>true</final>
</property>
Open the file $YOUR_PATH/dfs.exclude and add hostnames / ip addresses of node you need to stop.
execute
yarn rmadmin -refreshNodes
hdfs dfsadmin -refreshNodes
Check if the data nodes are in decommission checking the web interface.
Data Node Comissioning
Works in the same way of the Decommissioning
yarn-site.xml
<property>
<name>yarn.resourcemanager.nodes.include-path</name>
<value>$YOUR_PATH/dfs.include</value>
<final>true</final>
</property>
hdfs-site.xml
<property>
<name>dfs.hosts</name>
<value>$YOUR_PATH/dfs.include</value>
<final>true</final>
</property>
Open the file $YOUR_PATH/dfs.include and add hostnames / ip addresses of node you need to add.
yarn rmadmin -refreshNodes
hdfs dfsadmin -refreshNodes
wait some time
hdfs dfsadmin -report
Now the hosts you added are into the list.
If your configurations are missing the above keys you need to halt/restart the node manager and yarn after adding them.
Using these procedure you can halt data nodes in a safe way.

Should I have to run history server in all nodes to get job history in Hadoop Cluster WebUI

I am facing one issue in Hadoop cluster. I have a Hadoop cluster with 5 datanodes and one edge/gateway node.
My issue is that I had to start the history server in each of those nodes (1 namenode and 5 datanodes) to get any job history from hadoop webUI for any submitted job.
I have added mapreduce.jobhistory.address and mapreduce.jobhistory.webapp.address in mapred-site.xml
But it's not working properly I guess.
If I start the history server in name node or any other node only , Hadoop Cluster Web-UI is unable to show me the job history and ends up with some error.
My Mapred-site XML
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoopmaster:8021</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoopmaster:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoopmaster:19888</value>
</property>
</configuration>
For the time being as a workaround I start the history server in each node (namenode and all data node) manually. But think this is not right way.
Now I have 5 data node only so its still feasible to start history server in each and every node manually , but if case of multiple nodes(say 100/200) it will not be feasible any more to start history server in every node. There should be some standard solution for this issue...
Please help me out if anyone knows how to resolve this issue.
Thanks in advanceā€¦.
Finally I am able to solve the issue.
Actually in case of mapreduce.jobhistory.address , it will history server is running in one node only (jps).
It's working properly now...

hadoop no data node started

I am following this tutorial.
http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
I got to this point and started the nodes.
Start NameNode daemon and DataNode daemon:
$ sbin/start-dfs.sh
But then when I run the next steps, it looks like no data node is running (as I get errors saying so).
Why is the data node down? And how can I fix this?
Here is the log from my data node.
hduser#test02:/usr/local/hadoop$ jps
3792 SecondaryNameNode
3929 Jps
3258 NameNode
hduser#test02:/usr/local/hadoop$ cat /usr/local/hadoop/logs/hadoop-hduser-datanode-test02.out
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
-m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 3781
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
hduser#test02:/usr/local/hadoop$
EDIT:
Seems I had this port number wrong.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
Now when I made it right (i.e. equal to 9000) I have no name node starting up.
hduser#test02:/usr/local/hadoop$ jps
10423 DataNode
10938 Jps
10703 SecondaryNameNode
and I cannot browse:
http://my-server-name:50070/
any more.
Hope this gives you some hint what is happening.
I am total beginner with Hadoop and kind of lost now.
[core-site.xml]
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
[hdfs-site.xml]
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
In mapred-site.xml I have nothing.
1.first stop all the entities like namenode, datanode etc. (you will be having some script or command to do that)
Format tmp directory
Go to /var/cache/hadoop-hdfs/hdfs/dfs/ and delete all the contents in the directory manually
Now format your namenode again
start all the entities then use jps command to confirm that the datanode has been started
Now run whichever application you may like or have.
Hope this helps.
Add this configuration
conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
stop hadoop
bin/stop-all.sh
change permission and remove temp directory data
chmod 755 /var/lib/hadoop/tmp
rm -Rf /var/lib/hadoop/tmp/*
format name node
bin/hadoop namenode -format
After 1 day of struggle, I just removed version 2.4 and installed Hadoop 2.2 (as I realized 2.2 is the latest stable version). Then I got it all working by following this nice tutorial.
http://codesfusion.blogspot.com/2013/10/setup-hadoop-2x-220-on-ubuntu.html?m=1
Something is not right with this document about 2.4 which I was reading.
Not to talk that it's not suitable for beginners, and it's usually beginners who stumble upon it.
Maybe your slave's data master's data are not synced, delete data & name folder in ./hadoop/hdfs and recreate them. re-format namenode. Than start dfs.

get "ERROR: Can't get master address from ZooKeeper; znode data == null" when using Hbase shell

I installed Hadoop2.2.0 and Hbase0.98.0 and here is what I do :
$ ./bin/start-hbase.sh
$ ./bin/hbase shell
2.0.0-p353 :001 > list
then I got this:
ERROR: Can't get master address from ZooKeeper; znode data == null
Why am I getting this error ? Another question:
do I need to run ./sbin/start-dfs.sh and ./sbin/start-yarn.sh before I run base ?
Also, what are used ./sbin/start-dfs.sh and ./sbin/start-yarn.sh for ?
Here is some of my conf doc :
hbase-sites.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://127.0.0.1:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/Users/apple/Documents/tools/hbase-tmpdir/hbase-data</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/Users/apple/Documents/tools/hbase-zookeeper/zookeeper</value>
</property>
</configuration>
core-sites.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/micmiu/tmp/hadoop</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>io.native.lib.available</name>
<value>false</value>
</property>
</configuration>
yarn-sites.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
If you just want to run HBase without going into Zookeeper management for standalone HBase, then remove all the property blocks from hbase-site.xml except the property block named hbase.rootdir.
Now run /bin/start-hbase.sh. HBase comes with its own Zookeeper, which gets started when you run /bin/start-hbase.sh, which will suffice if you are trying to get around things for the first time. Later you can put distributed mode configurations for Zookeeper.
You only need to run /sbin/start-dfs.sh for running HBase since the value of hbase.rootdir is set to hdfs://127.0.0.1:9000/hbase in your hbase-site.xml. If you change it to some location on local the filesystem using file:///some_location_on_local_filesystem, then you don't even need to run /sbin/start-dfs.sh.
hdfs://127.0.0.1:9000/hbase says it's a place on HDFS and /sbin/start-dfs.sh starts namenode and datanode which provides underlying API to access the HDFS file system. For knowing about Yarn, please look at http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html.
This could also happen if the vm or the host machine is put to sleep ,Zookeeper will not stay live.
Restarting the VM should solve the problem.
You need to start zookeeper and then run Hbase-shell
{HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
and you may want to check this property in hbase-env.sh
# Tell HBase whether it should manage its own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=false
Refer to Source - Zookeeper
One quick solution could be to Restart hbase:
1) Stop-hbase.sh
2) Start-hbase.sh
I had the exact same error. The Linux firewall was blocking connectivity. One can test ports via telnet. A quick fix is to turn off the firewall and see if it fixes it:
Completely disable the firewall on all of your nodes. Note: this command will not survive a reboot of your machines.
systemctl stop firewalld
Long term fix is that you must configure the firewall to allow the hbase ports.
Note, your version of hbase may use different ports:
https://issues.apache.org/jira/browse/HBASE-10123
The output from Hbase shell is quite high level that many misconfiguration would cause this message. To help yourself debug, it would be much better to look into the hbase log in
/var/log/hbase
to figure out the root cause of the issue.
I had the same problem too. For me, my root cause was due to hadoop-kms having a conflicting port number with my hbase-master. Both of them are using port 16000 so my HMaster didn't even get started when I invoke hbase shell. After I fixed that, my hbase worked.
Again, kms port conflict might not be your root-cause. Strongly suggest looking into /var/log/hbase to find the root cause.
In my case with same error in running hbase - I did not include the zookeeper properties in the hbase-site.xml and still get the above error messages (as based in Apache hbase guide, only the two properites: rootdir, and distributed are essential).
I can also trace back my output of jps command that find out that indeed my Hregion server and Hmaster were not properly up and running.
After stop and start (like a reset), I did have these two up and running and can run hbase properly.
if it's happening in VMWare or virtual box please restart Cloudera by command init1 please check you have root privilege and retry hope it will help :)
hbase shell

How to get datanode timeout?

I have a 3 node hadoop setup, with replication factor as 2.
When one of my datanode dies, namenode waits for 10 mins before removing it from live nodes. Till then my hdfs writes fail saying bad ack from node.
Is there a way to set a smaller timeout( like 1 min) so that the node where datanode dies is discarded immediately ?
Setting up the following in your hdfs-site.xml will give you 1-minute timeout.
<property>
<name>heartbeat.recheck.interval</name>
<value>15</value>
<description>Determines datanode heartbeat interval in seconds</description>
</property>
If above doesn't work - try the following (seems to be version-dependent):
<property>
<name>dfs.heartbeat.recheck.interval</name>
<value>15</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
Timeout equals to 2 * heartbeat.recheck.interval + 10 * heartbeat.interval. Default for heartbeat.interval is 3 seconds.
In the version of Hadoop that we use, dfs.heartbeat.recheck.interval should be specified in milliseconds (check the code/doc of your version of Hadoop, to validate that).
I've managed to make this work. I'm using Hadoop version 0.2.2.
Here's what I added to my hdfs-site.xml:
<property>
<name>dfs.heartbeat.interval</name>
<value>2</value>
<description>Determines datanode heartbeat interval in seconds.</description>
</property>
<property>
<name>dfs.heartbeat.recheck.interval</name>
<value>1</value>
<description>Determines when machines are marked dead</description>
</property>
This parameters can differ for other versions of Hadoop. Here's how to check that you're using the right parameters: Once you set them, start your master, and check the configuration at :
http://your_master_machine:19888/conf
If you don't find "dfs.heartbeat.interval" and/or "dfs.heartbeat.recheck.interval" in there, that means you should try using their version without the "dfs." prefix:
"heartbeat.interval" and "heartbeat.recheck.interval"
Finally, to check that the dead datanode is no longer used after the desired amount of time, kill a datanode, then check repeatedly the console at:
http://your_master_machine:50070
For me, with the configuration shown here, I can see that a dead datanode is removed after about 20 seconds.

Resources