YARN : 1/1 local-dirs are bad Alert - yarnpkg

I have an issue where all 3 node managers in my cluster are marked as bad with local dirs bad alerts.
I have seen many answers where it says, this error is due to YARN reaching its maximum default disk threshold which is 90%, but I can assure I have plenty of space on the YARN disk. (just 35% of the disk is used). I suspect the YARN directory is corrupted.
Does anyone know of this alert/solution other than YARN reaching its disk threshold value??

I got the solution to this issue.
There was no write permission on the folder to other users except OWNER, I granted the write permission on YARN folder to YARN and I could run the map reduce job. All 3 node managers are healthy now.

There are also other scenarios when this would happen:
disk is bad or going bad
disk has space, but has exhausted it's inodes
disk was mounted read-only
disk is a NFS mounted and the NFS server is down or lost connectivity

Related

HDFS Showing 0 Blocks after cluster reboot

I've setup a small cluster for testing / academic proposes, I have 3 nodes, one of which is acting both as namenode and datanode (and secondarynamenode).
I've uploaded 60GB of files (about 6.5 Million files) and uploads started to get really slow, so I read on the internet that I could stop the secondary namenode service on the main machine, at the moment it had no effect on anything.
After I rebooted all 3 computers, two of my datanodes show 0 blocks (despite showing disk usage in web interface) even with both namenodes services running.
One of the nodes with problem is the one running the namenode as well so I am guessing it is not a network problem.
any ideas on how can I get these blocks to be recognized again? (without start it all over again which took about two weeks to upload all)
Update
After half an hour after another reboot this showed in the logs:
2018-03-01 08:22:50,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x199d1a180e357c12, containing 1 storage report(s), of which we sent 0. The reports had 6656617 total blocks and used 0 RPC(s). This took 679 msec to generate and 94 msecs for RPC and NN processing. Got back no commands.
2018-03-01 08:22:50,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.EOFException: End of File Exception between local host is: "Warpcore/192.168.15.200"; destination host is: "warpcore":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
And the EOF stack trace, after searching the web I discovered this [http://community.cloudera.com/t5/CDH-Manual-Installation/CDH-5-5-0-datanode-failed-to-send-a-large-block-report/m-p/34420] but still can't understand how to fix this.
The report block is too big and need to be split, but I don't know how or where should I configure this. I´m googling...
The problem seems to be low RAM on my namenode, as a workaround I added more directories to the namenode configuration as if I had multiple disks and rebalanced the files manually as instructed ins the comments here.
As hadoop 3.0 reports each disk separately the datenode was able to report and I was able to retrieve the files, this is an ugly workaround and not for production, but good enough for my academic purposes.
An interesting side effect was the datanode reporting multiple times the available disk space wich could lead into serious problems on production.
It seems a better solution is using HAR to reduce the number of blocks as described here and here

HDFS /tmp filesystem is filling up rapidly and expected to cause outage

In our Hadoop cluster (Cloudera distribution), we recently found that Hive Job is started by user create a 160 TB of files in '/tmp' location and it almost consumed remaining HDFS space and about to cause an outage. Later we troubleshoot and kill the particular job as we are unable to reach the user who started this job.
So now my question is could we able to set an alert for '/tmp' location if anyone created huge files or we need to restrict the users using HDFA quota. Please share if you have any other suggestion.
You can set and manage quota for a directory by using the below set of commands
hdfs dfsadmin -setQuota <N> <directory>...<directory>
hdfs dfsadmin -clrQuota <directory>...<directory>
hdfs dfsadmin -setSpaceQuota <N> <directory>...<directory>
hdfs dfsadmin -clrSpaceQuota <directory>...<directory>
*where N is the Number of bytes you want to set
Reference Link
Helpful article
Hope this helps your scenario.
You can also manage resources from Cloudera Manager in Yarn resource pool from the processing side. You can limit max cores and memory allocated to each user or service running on your cluster.

ambari metrics diskspacecheck issue

I am installing a new hadoop cluster(total 5 nodes) using the Ambari dashboard. While deploying the cluster it fails but with warnings of disk space issues and error messages like '/' needs atleast 2GB of diskspace for mount. But I have allocated total 50GB of disk to each node. Upon googling for the solution I found that I need to make diskspacecheck=0 in the etc/yum.conf file as suggested in the below link(point 3.6):
http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.0.0/bk_ambari_troubleshooting/content/_resolving_ambari_installer_problems.html
But I am using ubuntu image in the nodes and there is no yum file. And I didn't get any file with "diskspacecheck" parameter. Can anybody tell me how to solve this issue and successfully deploy my cluster?

Continously shows Capacity used 90%

I've two questions.
How to mount the directory for Ambari disk usage.
I started to run the tera gen program and it does not go beyond 10% map tasks, Ambari continously shows me the message that: Capacity Used: [90.69%, 27.7 GB], Capacity Total: [30.5 GB], path=/usr/hdp I restarted the cluster, restarted Ambari but no use.
What is the way around?
Well,
After a few trial error I found the solution for the same.
You can change the location of log and local directories to bigger place
Remove the old log files from Ambari server.
Documented here.

Region server geting down frequently after system start

I am running hbase on HDP on Amazon machine,
When i reboot my system and start all hbase services, it get started.
But after some time my region server get down.
Latest error that i am getting from its log file is that
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /apps/hbase/data/usertable/dd5a251551619e0109349a0dce855e1b/recovered.edits/0000000000000001172.temp could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1657)
Now i am not able to start it.
Any suggestion why it is happing.
Thanks in advance.
Make sure you datanodes are up and running. Also, set "dfs.data.dir" to some permanent location, if you haven't done it yet. It defaults to the "/tmp" dir which gets emptied at each restart. Also, make sure that your datanodes are able to talk to the namenode and there is no network related issue and the datanode machines have enough free space left.

Resources