Find out directory size considering replication in HDFS

Find out directory size considering replication in HDFS - hadoop

Is there any way to find out raw HDFS space consumption by a directory. As far as I know
hdfs dfs -du -s /dir
shows /dir size not considering replication of inner files.

Run the command hadoop fsck /dir and look for the parameter Average block replication. Multiple this number by the result you have from hdfs dfs -du -s /dir.

Related

How to list hidden directory in Hadoop?

I have created a directory and set a quota in HDFS using the following commands:
hdfs dfs -mkdir /user/hdadmin/directorio_prueba
hdfs dfsadmin -setQuota 4 /user/hdadmin/directorio_prueba
The I have put some files in it:
hdfs dfs -put /opt/bd/ejemplo1.txt /user/hdadmin/directorio_prueba
hdfs dfs -put /opt/bd/ejemplo2.txt /user/hdadmin/directorio_prueba
hdfs dfs -put /opt/bd/ejemplo3.txt /user/hdadmin/directorio_prueba
But when I tried to put the fourth file, the HDFS did not let me saying "The NameSpace quota (directories and files) of directory /user/hdadmin/directorio_prueba is exceeded: quota=4 file count=5". I only have 3 files, but it says there are 4 items (directories and files in the directory). I have also used the following command to gather more information:
hdfs dfs -count -q -h -v /user/hdadmin/directorio_prueba
So there is a hidden directory there. What is this directory? Maybe "." or ".."?

You can directly view hidden files using the command
hdfs dfs -ls /user/hdfs
Please read show hidden hdfs files

Reading the official documentation for the HDFS name quotas (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html) I found this: 'A quota of one forces a directory to remain empty. (Yes, a directory counts against its own quota!)'.
So there is not ".." nor "." directory. It is the directory itself that counts to the quota. However the "." directory is not explicitly shown. That is why the hdfs dfs -ls /user/hdadmin/directorio_prueba command did not show me any hidden directory like ".".

HDFS dfs full path

How to find full path for HDFS storage in my system?
e.g. I have /user/cloudera/ folder on hdfs storage, but what is path to the "/user/cloudera"? Are there any specific commands?
HDFS dfs -ls and HDFS dfs -ls -R return only directory list, but not path.
My question is unique, because in here you don't get the HDFS path in the end.

If you are an HDFS admin, you can run:
hdfs fsck /user/cloudera -files -blocks -locations
References:
HDFS Commands Guide: fsck
hdfs file actual block paths

How to find out which application consume the most space on hadoop?

My hadoop cluster shows it has less than 20% disk space left. I am using this command to see disk space
hdfs dfsadmin -report
However, i don't know which directory/files that takes up the most space. Is there a way to find out?

use the following command.
hdfs dfs -du /
It displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

How do you retrieve the replication factor info in Hdfs files?

I have set the replication factor for my file as follows:
hadoop fs -D dfs.replication=5 -copyFromLocal file.txt /user/xxxx
When a NameNode restarts, it makes sure under-replicated blocks are replicated.
Hence the replication info for the file is stored (possibly in nameNode). How can I get that information?

Try to use command hadoop fs -stat %r /path/to/file, it should print the replication factor.

You can run following command to get replication factor,
hadoop fs -ls /user/xxxx
The second column in the output signify replication factor for the file and for the folder it shows -, as shown in below pic.

Apart from Alexey Shestakov's answer, which works perfectly and does exactly what you ask, other ways, mostly found here, include:
hadoop dfs -ls /parent/path
which shows the replication factors of all the /parent/path contents on the second column.
Through Java, you can get this information by using:
FileStatus.getReplication()
You can also see the replication factors of files by using:
hadoop fsck /filename -files -blocks -racks
Finally, from the web UI of the namenode, I believe that this information is also available (didn't check that).

We can use following commands to check replication of the file.
hdfs dfs -ls /user/cloudera/input.txt
or
hdfs dfs -stat %r /user/cloudera/input.txt

In case if you need to check replication factor of a HDFS directory
hdfs fsck /tmp/data
shows the average replication factor of /tm/data/ HDFS folder

How to see entire root hdfs disk usage? (hadoop dfs -du / gets subfolders)

We have perhaps not unsurprisingly given how fascinating big data is to the business, a disk space issue we'd like to monitor on our hadoop clusters.
I have a cron job running and it is doing just what I want except that I'd like one of the output lines to show the overall space used. In other words, in bash, the very last line of a "du /" command shows the total usage for all the subfolders on the entire disk. I'd like that behavior.
Currently when I run "hadoop dfs -du /", however, I get only the subdirectory info and not the overall total.
What's the best way to get this?
thank you so much to all you Super Stack Overflow people :).

I just didn't understand the docs correctly! Here is the answer to get the total space used;
$ hadoop dfs -dus /
hdfs://MYSERVER.com:MYPORT/ 999
$ array=(`hadoop dfs -dus /`)
$ echo $array
hdfs://MYURL:MYPORT/
$ echo ${array[1]} ${array[0]}
999 hdfs://MYURL:MYPORT/
Reference; File System Shell Guide
http://hadoop.apache.org/docs/r1.2.1/file_system_shell.html#du
//edit; Also corrected the order of reporting to match the original.

hadoop fs -du -s -h /path
This will give you the summary.
For the whole cluster you can try :
hdfs dfsadmin -report
You may need to run this with HDFS user.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Find out directory size considering replication in HDFS - hadoop

Is there any way to find out raw HDFS space consumption by a directory. As far as I know hdfs dfs -du -s /dir shows /dir size not considering replication of inner files.

Run the command hadoop fsck /dir and look for the parameter Average block replication. Multiple this number by the result you have from hdfs dfs -du -s /dir.

Related

How to list hidden directory in Hadoop?

HDFS dfs full path

How to find out which application consume the most space on hadoop?

How do you retrieve the replication factor info in Hdfs files?

How to see entire root hdfs disk usage? (hadoop dfs -du / gets subfolders)

Categories

Resources