How to find out which application consume the most space on hadoop?

How to find out which application consume the most space on hadoop? - hadoop

My hadoop cluster shows it has less than 20% disk space left. I am using this command to see disk space
hdfs dfsadmin -report
However, i don't know which directory/files that takes up the most space. Is there a way to find out?

use the following command.
hdfs dfs -du /
It displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

Related

What is the best way of loading huge size files from local to hdfs

I have dir which contain multiple folder with N number of files in each dir. single file size would be 15 GB. I don't know what is the best way to copy/move the file from local to HDFS.

There are many ways to do this (using traditional methods), like,
hdfs dfs -put /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -copyFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -moveFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hadoop distcp file:///path/to/localdir/ hdfs://namenode:port/path/to/hdfsdir
Option 1 & 2 are same in your case. There will not be any difference in copy time.
Option 3 might take some more time as it copies the data to HDFS filesystem (same as -put) and then deletes the file from the local filesystem.
Option 4 is a tricky one. It is designed for large inter/intra-cluster copying. But you can use the same command for local files also by providing local file URL with "file://" prefix. It is not the optimal solution w.r.t distcp as the tool is designed to work in parallel (using MapReduce) and as the file is on local, it cannot make use of its strength. (You can try by creating a mount on the cluster nodes which might increase the performance of distcp)

Find out directory size considering replication in HDFS

Is there any way to find out raw HDFS space consumption by a directory. As far as I know
hdfs dfs -du -s /dir
shows /dir size not considering replication of inner files.

Run the command hadoop fsck /dir and look for the parameter Average block replication. Multiple this number by the result you have from hdfs dfs -du -s /dir.

How do you retrieve the replication factor info in Hdfs files?

I have set the replication factor for my file as follows:
hadoop fs -D dfs.replication=5 -copyFromLocal file.txt /user/xxxx
When a NameNode restarts, it makes sure under-replicated blocks are replicated.
Hence the replication info for the file is stored (possibly in nameNode). How can I get that information?

Try to use command hadoop fs -stat %r /path/to/file, it should print the replication factor.

You can run following command to get replication factor,
hadoop fs -ls /user/xxxx
The second column in the output signify replication factor for the file and for the folder it shows -, as shown in below pic.

Apart from Alexey Shestakov's answer, which works perfectly and does exactly what you ask, other ways, mostly found here, include:
hadoop dfs -ls /parent/path
which shows the replication factors of all the /parent/path contents on the second column.
Through Java, you can get this information by using:
FileStatus.getReplication()
You can also see the replication factors of files by using:
hadoop fsck /filename -files -blocks -racks
Finally, from the web UI of the namenode, I believe that this information is also available (didn't check that).

We can use following commands to check replication of the file.
hdfs dfs -ls /user/cloudera/input.txt
or
hdfs dfs -stat %r /user/cloudera/input.txt

In case if you need to check replication factor of a HDFS directory
hdfs fsck /tmp/data
shows the average replication factor of /tm/data/ HDFS folder

How to see entire root hdfs disk usage? (hadoop dfs -du / gets subfolders)

We have perhaps not unsurprisingly given how fascinating big data is to the business, a disk space issue we'd like to monitor on our hadoop clusters.
I have a cron job running and it is doing just what I want except that I'd like one of the output lines to show the overall space used. In other words, in bash, the very last line of a "du /" command shows the total usage for all the subfolders on the entire disk. I'd like that behavior.
Currently when I run "hadoop dfs -du /", however, I get only the subdirectory info and not the overall total.
What's the best way to get this?
thank you so much to all you Super Stack Overflow people :).

I just didn't understand the docs correctly! Here is the answer to get the total space used;
$ hadoop dfs -dus /
hdfs://MYSERVER.com:MYPORT/ 999
$ array=(`hadoop dfs -dus /`)
$ echo $array
hdfs://MYURL:MYPORT/
$ echo ${array[1]} ${array[0]}
999 hdfs://MYURL:MYPORT/
Reference; File System Shell Guide
http://hadoop.apache.org/docs/r1.2.1/file_system_shell.html#du
//edit; Also corrected the order of reporting to match the original.

hadoop fs -du -s -h /path
This will give you the summary.
For the whole cluster you can try :
hdfs dfsadmin -report
You may need to run this with HDFS user.

hadoop fs -ls out of memory error

I have 300000+ files on a HDFS data directory.
When I do a hadoop fs -ls and I am getting an out of memory error saying GC Limit has exceeded. The cluster nodes have 256 GB of RAM each. How do I fix it?

You can make more memory available to the hdfs command by specifying 'HADOOP_CLIENT_OPTS'
HADOOP_CLIENT_OPTS="-Xmx4g" hdfs dfs -ls /
Found here: http://lecluster.delaurent.com/hdfs-ls-and-out-of-memory-gc-overhead-limit/
This fixed the problem for me, I had over 400k files in one directory and needed to delete most but not all of them.

Write a python script to split the files into multiple directories and run through them. First of all what are you trying to achieve when you know you have 300000+ files in a directory. If you want to concatenate better arrange them into sub dirs.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to find out which application consume the most space on hadoop? - hadoop

My hadoop cluster shows it has less than 20% disk space left. I am using this command to see disk space hdfs dfsadmin -report However, i don't know which directory/files that takes up the most space. Is there a way to find out?

use the following command. hdfs dfs -du / It displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

Related

What is the best way of loading huge size files from local to hdfs

Find out directory size considering replication in HDFS

How do you retrieve the replication factor info in Hdfs files?

How to see entire root hdfs disk usage? (hadoop dfs -du / gets subfolders)

hadoop fs -ls out of memory error

Categories

Resources