How to find the slowest datanodes? - hadoop

I install one HDFS cluster that have 15 datanodes. Sometimes the writing performance of the entire hdfs cluster is slow.
How i to find the slowest datanode,which node can cause this problem。

The most common cause of a slow datanode is a bad disk. Disk timeout errors (EIO) defaults range from 30 to 90 seconds so any activity on that disk will take a long time.
You can check this by looking at dfs.datanode.data.dir in your hdfs-site.xmls for every datanode and verifying that each of the directories mentioned actually work.
For example:
run ls on the directory
cd into the directory
create a file under the directory
write into a file under the directory
read contents of a file under the directory
If any of these activities don't work or take a long time, then that's your problem.
You can also run dmesg on each host and look for disk errors.
Additional Information
HDFS DataNode Scanners and Disk Checker Explained
HDFS-7430 - Rewrite the BlockScanner to use O(1) memory and use multiple threads
https://superuser.com/questions/171195/how-to-check-the-health-of-a-hard-drive

Related

How I know what is safe to delete in /mnt/yarn/usercache and /var/log/hadoop-yarn/containers directories?

I have an EMR cluster running on AWS. I look in YARN and I see that 4 of my workers have this "unhealthy status" due to
1/2 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
So, I ssh into the worker nodes -> run df and sure enough /mnt/yarn is at 99% disk spaced used. Also, a lot of the stderror and stdout files are taking up a lot of space in /var/log/hadoop-yarn/containers directory. My question is: what is safe to delete and what is not? I feel like I have been going down a rabbit hole and am still nowhere figuring out how to free up disk space in my worker nodes after reading for hours. I have been reading about /mnt/yarn/usercache directory, and it seems like the contents in that directory are "local resources" used to run my spark application. But /mnt/yarn/usercache/hadoop/filecache and /mnt/yarn/usercache/hadoop/appcache are taking up 3% and 96% of disk space, respectively, in /mnt/yarn
Probably you need to clear this folder - /var/log/hadoop-yarn/apps/hadoop/logs/
on HDFS. Try hdfs dfs -ls /var/log/hadoop-yarn/apps/hadoop/logs to view it.
Another option to check - /mnt/var/log/hadoop-yarn/containers on executors.
There should be another folders with name like "application_someId". These folders contains logs for finished and running spark job.
yes, you can delete the container files in /mnt/var/log/hadoop-yarn/containers (as well as the log files in there).. I had a very similar problem.
I deleted the files, stopped and restarted spark on EMR and my unhealthy nodes came back.

How to successfully complete a namenode restart with 5TB worth of edit files to process

I have a namenode that had to be brought down for an emergency that has not had an FSImage taken for 9 months and has about 5TB worth of edit files to process in its next restart. The secondary namenode has not been running (or had any checkpoint operations performed) since about 9 months ago, thus the 9 month old FSImage.
There are about 7.8 million inodes in the HDFS cluster. The machine has about 260GB of total memory.
We've tried a few different combinations of Java heap size, GC algorithms, etc... but have not been able to find a combination that allows the restart to complete without eventually slowing down to a crawl due to FGCs.
I have 2 questions:
1. Has anyone found a namenode configuration that allows this large of an edit file backlog to complete successfully?
An alternate approach I've considered is restarting the namenode with only a manageable subset of the edit files present. Once the namenode comes up and creates a new FSImage, bring it down, copy the next subset of edit files over, and then restart it. Repeat until it's processed the entire set of edit files. Would this approach work? Is it safe to do, in terms of the overall stability of the system and the file system?
We were able to get through the 5TB backlog of edits files using a version of what I suggested in my question (2) on the original post. Here is the process we went through:
Solution:
Make sure that the namenode is "isolated" from the datanodes. This can be done by either shutting down the datanodes, or just removing them from the slaves list while the namenode is offline. This is done to keep the namenode from being able to communicate with the datanodes before the entire backlog of edits files is processed.
Move the entire set of edits files to a location outside of what is configured on the dfs.namenode.name.dir property of the namenode's hdfs-site.xmlfile.
Move (or copy if you would like to maintain a backup) the next subset of edits files to be processed to the dfs.namenode.name.dir location. If you are not familiar with the naming convention for the FSImage and edits files, take a look at the example below. It will hopefully clarify what is meant by next subset of edits files.
Update file seen_txid to contain the value of the last transaction represented by the last edits file from the subset you copied over in step (3). So if the last edits file is edits_0000000000000000011-0000000000000000020, you would want to update the value of seen_txid to 20. This essentially fools the namenode into thinking this subset is the entire set of edits files.
Start up the namenode. If you take a look at the Startup Progress tab of the HDFS Web UI, you will see that the namenode will start with the latest present FSImage, process through the edits files present, create a new FSImage file, and then go into safemode while it waits for the datanodes to come online.
Bring down the namenode
There will be edits_inprogress_######## file created as a placeholder by the namenode. Unless this is the final set of edits files to process, delete this file.
Repeat steps 3-7 until you've worked through the entire backlog of edits files.
Bring up datanodes. The namenode should get out of safemode once it's been able to confirm the location of a number of data blocks.
Set up a secondary namenode, or high availability for your cluster, so that the FSImage will periodically get created from now on.
Example:
Let's say we have FSImage fsimage_0000000000000000010 and a bunch of edits files: edits_0000000000000000011-0000000000000000020
edits_0000000000000000021-0000000000000000030
edits_0000000000000000031-0000000000000000040
edits_0000000000000000041-0000000000000000050
edits_0000000000000000051-0000000000000000060
...
edits_0000000000000000091-0000000000000000100
Following the steps outlined above:
All datanodes brought offline.
All edits files copied from dfs.namenode.name.dir to another location, ex: /tmp/backup
Let's process 2 files at a time. So copy edits_0000000000000000011-0000000000000000020 and edits_0000000000000000021-0000000000000000030 over to the dfs.namenode.name.dir location.
Update seen_txid to contain a value of 30 since this is the last transaction we will be processing during this run.
Start up the namenode, and confirm through the HDFS Web UI's Startup Progress tab that it correctly used fsimage_0000000000000000010 as a starting point and then processed edits_0000000000000000011-0000000000000000020 and edits_0000000000000000021-0000000000000000030. It then created a new FSImage file fsimage_0000000000000000030` and entered safemode, waiting for the datanodes to come up.
Bring down the namenode
Delete the placeholder file edits_inprogress_######## since this is not the final set of edits files to be processsed.
Proceed with the next run and repeat until all edits files have been processed.
If your hadoop is HA enabled, then StandBy NN should have taken care of this, in case of non-HA your secondary NN.
Check logs of these namenode processes as why it is not able to merge/fail.
These below parameters drive your edit files save, and it shouldnt have created these many files.
dfs.namenode.checkpoint.period
dfs.namenode.checkpoint.txns
other way to manually perform the merge but this would be temporary.
hdfs dfsadmin -safemode enter
hdfs dfsadmin -rollEdits
hdfs dfsadmin -saveNamespace
hdfs dfsadmin -safemode leave
Running above command should merge and save the namespaces.

Hadoop HDFS file recovery elapsed time (start time and end time)

I need to measure the speed of recovery for files with different file sizes stored with different storage policies (replication and erasure codes) in HDFS.
My question is: Is there a way to see the start time and end time (or simply the elapsed time in seconds) for a file recovery process in HDFS? For a specific file?
I mean the start time from where the system detects node failures (and starts the recovery process), and until HDFS recovers the data (and possibly reallocates nodes) and makes the file "stable" again?
Maybe I can look into some metadata files or log files of the particular file to see some timestamps etc? Or is there a file where I can see all the activity of a HDFS file?
I would really appreciate some terminal commands to get this info.
Thank you so much in advance!

What does it mean by 'local file system?'

I'm currently reading about hadoop and I came across this which has puzzled me (please bear in mind that I am a complete novice when it comes to hadoop) -
Use the Hadoop get command to copy a file from HDFS to your local file
system:
$ hadoop hdfs dfs -get file_name /user/login_user_name
What is a local file system? I understand that a HDFS partitions a file into different blocks throughout the cluster (but I know there's more to it than that). My understanding of the above command is that I can copy a file from the cluster to my personal (i.e. local) computer? Or is that completely wrong? I'm just not entirely sure what is meant by a local file system.
LocalFS means it may be your LinuxFS or WindowsFS. And which is not part of the DFS.
Your understanding is correct, using -get you are getting file from HDFS to Local FS and you cannot use both hadoop and hdfs. Command should be like below
hdfs dfs -get file_name local_path or hadoop fs -get file_name local_path
As per the file system logic, you can divide the filesystem into different drives. in the same way, you can create hadoop file system as a separate filesystem in linux file system.
Your local file system would be the File system over which you have installed the hadoop.your machine act as local in this case when copying file from your machine to the hadoop.
you might want to look at :HDFS vs LFS
THINK of a cluster node (server) as having to fulfill 2 needs:
the need to store its own operating system, application and user data-related files; and
the need to store its portion of sharded or "distributed" cluster data files.
In each cluster data node then there needs to be 2 independent file systems:
the LOCAL ("non-distributed") file system:
stores the OS and all OS-related ancillary ("helper") files;
stores the binary files which form the applications which run on the server;
stores additional data files, but these exist as simple files which are NOT sharded/replicated/distributed in the server's "cluster data" disks;
typically comprised of numerous partitions - entire formatted portions of a single disk or multiple disks;
typically also running LVM in order to ensure "expandability" of these partitions containing critical OS-related code which cannot be permitted to saturate or the server will suffer catastrophic (unrecoverable) failure.
AND
the DISTRIBUTED file system:
stores only the sharded, replicated portions of what are actually massive data files "distributed" across all the other data drives of all the other data nodes in the cluster
typically comprised of at least 3 identical disks, all "raw" - unformatted, with NO RAID of any kind and NO LVM of any kind, because the cluster software (installed on the "local" file system) is actually responsible for its OWN replication and fault-tolerance, so that RAID and LVM would actually be REDUNDANT, and therefore cause unwanted LATENCY in the entire cluster performance.
LOCAL <==> OS and application and data and user-related files specific or "local" to the specific server's operation itself;
DISTRIBUTED <==> sharded/replicated data; capable of being concurrently processed by all the resources in all the servers in the cluster.
A file can START in a server's LOCAL file system where it is a single little "mortal" file - unsharded, unreplicated, undistributed; if you were to delete this one copy, the file is gone gone GONE...
... but if you first MOVE that file to the cluster's DISTRIBUTED file system, where it becomes sharded, replicated and distributed across at least 3 different drives likely on 3 different servers which all participate in the cluster, so that now if a copy of this file on one of these drives were to be DELETED, the cluster itself would still contain 2 MORE copies of the same file (or shard); and WHERE in the local system your little mortal file could only be processed by the one server and its resources (CPUs + RAM)...
... once that file is moved to the CLUSTER, now it's sharded into myriad smaller pieces across at least 3 different servers (and quite possibly many many more), and that file can have its little shards all processed concurrently by ALL the resources (CPUs & RAM) of ALL the servers participating in the cluster.
And that the difference between the LOCAL file system and the DISTRIBUTED file system operating on each server, and that is the secret to the power of cluster computing :-) !...
Hope this offers a clearer picture of the difference between these two often-confusing concepts!
-Mark from North Aurora

How to make Hadoop Map Reduce process multiple files in a single run ?

For Hadoop Map Reduce program when we run it by executing this command $hadoop jar my.jar DriverClass input1.txt hdfsDirectory. How to make Map Reduce process multiple files( input1.txt & input2.txt ) in a single run ?
Like that:
hadoop jar my.jar DriverClass hdfsInputDir hdfsOutputDir
where
hdfsInputDir is the path on HDFS where your input files are stored (i.e., the parent directory of input1.txt and input2.txt)
hdfsOutputDir is the path on HDFS where the output will be stored (it should not exist before running this command).
Note that your input should be copied on HDFS before running this command.
To copy it to HDFS, you can run:
hadoop dfs -copyFromLocal localPath hdfsInputDir
This is your small files problem. for every file mapper will run.
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
solution
HAR files
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. To a client using the HAR filesystem nothing has changed: all of the original files are visible and accessible (albeit using a har:// URL). However, the number of files in HDFS has been reduced.
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record).

Resources