Hadoop HDFS file recovery elapsed time (start time and end time) - hadoop

I need to measure the speed of recovery for files with different file sizes stored with different storage policies (replication and erasure codes) in HDFS.
My question is: Is there a way to see the start time and end time (or simply the elapsed time in seconds) for a file recovery process in HDFS? For a specific file?
I mean the start time from where the system detects node failures (and starts the recovery process), and until HDFS recovers the data (and possibly reallocates nodes) and makes the file "stable" again?
Maybe I can look into some metadata files or log files of the particular file to see some timestamps etc? Or is there a file where I can see all the activity of a HDFS file?
I would really appreciate some terminal commands to get this info.
Thank you so much in advance!

Related

Why do files need time to synchronize after writing in writeType THROUGH in Alluxio?

When I write file in directory mounted by alluxio-fuse using writeType THROUGH. I find that it takes 2-3 minutes to synchronize files. Why do files need time to synchronize?
Following is mount direactory. write time : 15:40. after sync: 15:43
When writing to Alluxio with THROUGH writeType, Alluxio will first update its metadata to show files with zero bytes (like the 9.txt in your first image). When the file is successfully written to Alluxio Under FileSystem, Alluxio will update its metadata to show the actual size of this file (9.txt show its actual size as 209715200 bytes).
The 2 to 3 minutes is the time that Alluxio writes data to under filesystem.
Thanks,

How to find the slowest datanodes?

I install one HDFS cluster that have 15 datanodes. Sometimes the writing performance of the entire hdfs cluster is slow.
How i to find the slowest datanode,which node can cause this problem。
The most common cause of a slow datanode is a bad disk. Disk timeout errors (EIO) defaults range from 30 to 90 seconds so any activity on that disk will take a long time.
You can check this by looking at dfs.datanode.data.dir in your hdfs-site.xmls for every datanode and verifying that each of the directories mentioned actually work.
For example:
run ls on the directory
cd into the directory
create a file under the directory
write into a file under the directory
read contents of a file under the directory
If any of these activities don't work or take a long time, then that's your problem.
You can also run dmesg on each host and look for disk errors.
Additional Information
HDFS DataNode Scanners and Disk Checker Explained
HDFS-7430 - Rewrite the BlockScanner to use O(1) memory and use multiple threads
https://superuser.com/questions/171195/how-to-check-the-health-of-a-hard-drive

When runtime modifications are written to Edits log file in Name Node, is the Edits Log file getting updated on RAM or Local Disk

When run-time modifications are written to Edits log file in Name Node, is the Edits Log file getting updated on RAM or Local Disk
The answer is both. First on the disk and then on the RAM.
To start with, edits log is a logical entity, whereas in the real case it can be many files spread across (called as segments), with a naming convention similar to "edits_xxxxxxxxxxx", each of which represents a particular action (called as transaction) done in HDFS such as append file, delete file, etc.,
Edits file/segment is updated first (on the disk) and then the in-memory (in RAM) metadata of the NN is updated. Thereafter this in-memory data will be served to the needy clients.
Courtesy: Hadoop - The definitive guide.

How to detect memory leak or other system-wide problems caused by a batch processing job?

First the question:
How do I monitor the complete flow of the batch file, including influences to the surrounding system, all the way from initiating the cmd.exe to tearing it down after the script has completed?
Then the reason:
I've recently set up a batch script job to extract data, pack the data and transfer the compressed file to an external storage. It's a massive job executing for about 10 hours, and using lots of disk space.
It's the only change to the system (Windows Server 2008 SE SP2, 32bit with 4GB RAM) in a very long time, and just a few minutes after running the full script the for the second time, the server crashed hard, and had to be power-cycled.
I've been going through the script's log file, and everything worked flawless - at least down to the last line in the script that outputs the last line in the log file. Then I checked systems event log files, and found nothing to indicate an eminent problem... But I still very much suspect this script to be the triggering cause! Perhaps some sort of memory leak or memory fragmentation could be involved?
Finally the overall script operation and numbers:
The data is extracted from database files (SubVersion) on one disk, and temporarily stored on another disk (in about 150.000 files of varying size, taking up about 250 GB of space), then combined into about 1000 files which are compressed with 7zip down to about 22GB, which is transferred to external storage. Finally, all temporary files are removed, with the exception of a log file.
The initial batch script calls several other batch scripts in the cause of processing the data, during the task described above other scripts or commands are called about 20.000 times. Total number of lines in the scripts are about 600, with several for loops involved.

WAL files in HBase

In HBase before writing into memstore data will be first written into the WAL, but when i checked on my system
WAL files are not getting updated immediately after each Put operation, it's taking lot of time to update. Is there any parameter i need to set?
(WAL has been enabled)
Do you know how long it takes to update the WAL files? Are you sure the time is taken in write or by the time you check WAL, it is already moved to old logs. If WAL is enabled all the entries must come to WAL first and then written to particular region as cluster configured.
I know that WAL files are moved to .oldlogs fairly quickly i.e. 60 seconds as defined in hbase-site.xml through hbase.master.logcleaner.ttl setting.
In standalone mode writing into WAL is taking a lot of time, where as on pseudo distributed mode it's working fine

Resources