My hadoop cluster shows it has less than 20% disk space left. I am using this command to see disk space
hdfs dfsadmin -report
However, i don't know which directory/files that takes up the most space. Is there a way to find out?
use the following command.
hdfs dfs -du /
It displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
Related
I have dir which contain multiple folder with N number of files in each dir. single file size would be 15 GB. I don't know what is the best way to copy/move the file from local to HDFS.
There are many ways to do this (using traditional methods), like,
hdfs dfs -put /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -copyFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -moveFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hadoop distcp file:///path/to/localdir/ hdfs://namenode:port/path/to/hdfsdir
Option 1 & 2 are same in your case. There will not be any difference in copy time.
Option 3 might take some more time as it copies the data to HDFS filesystem (same as -put) and then deletes the file from the local filesystem.
Option 4 is a tricky one. It is designed for large inter/intra-cluster copying. But you can use the same command for local files also by providing local file URL with "file://" prefix. It is not the optimal solution w.r.t distcp as the tool is designed to work in parallel (using MapReduce) and as the file is on local, it cannot make use of its strength. (You can try by creating a mount on the cluster nodes which might increase the performance of distcp)
Is there any way to find out raw HDFS space consumption by a directory. As far as I know
hdfs dfs -du -s /dir
shows /dir size not considering replication of inner files.
Run the command hadoop fsck /dir and look for the parameter Average block replication. Multiple this number by the result you have from hdfs dfs -du -s /dir.
I have set the replication factor for my file as follows:
hadoop fs -D dfs.replication=5 -copyFromLocal file.txt /user/xxxx
When a NameNode restarts, it makes sure under-replicated blocks are replicated.
Hence the replication info for the file is stored (possibly in nameNode). How can I get that information?
Try to use command hadoop fs -stat %r /path/to/file, it should print the replication factor.
You can run following command to get replication factor,
hadoop fs -ls /user/xxxx
The second column in the output signify replication factor for the file and for the folder it shows -, as shown in below pic.
Apart from Alexey Shestakov's answer, which works perfectly and does exactly what you ask, other ways, mostly found here, include:
hadoop dfs -ls /parent/path
which shows the replication factors of all the /parent/path contents on the second column.
Through Java, you can get this information by using:
FileStatus.getReplication()
You can also see the replication factors of files by using:
hadoop fsck /filename -files -blocks -racks
Finally, from the web UI of the namenode, I believe that this information is also available (didn't check that).
We can use following commands to check replication of the file.
hdfs dfs -ls /user/cloudera/input.txt
or
hdfs dfs -stat %r /user/cloudera/input.txt
In case if you need to check replication factor of a HDFS directory
hdfs fsck /tmp/data
shows the average replication factor of /tm/data/ HDFS folder
We have perhaps not unsurprisingly given how fascinating big data is to the business, a disk space issue we'd like to monitor on our hadoop clusters.
I have a cron job running and it is doing just what I want except that I'd like one of the output lines to show the overall space used. In other words, in bash, the very last line of a "du /" command shows the total usage for all the subfolders on the entire disk. I'd like that behavior.
Currently when I run "hadoop dfs -du /", however, I get only the subdirectory info and not the overall total.
What's the best way to get this?
thank you so much to all you Super Stack Overflow people :).
I just didn't understand the docs correctly! Here is the answer to get the total space used;
$ hadoop dfs -dus /
hdfs://MYSERVER.com:MYPORT/ 999
$ array=(`hadoop dfs -dus /`)
$ echo $array
hdfs://MYURL:MYPORT/
$ echo ${array[1]} ${array[0]}
999 hdfs://MYURL:MYPORT/
Reference; File System Shell Guide
http://hadoop.apache.org/docs/r1.2.1/file_system_shell.html#du
//edit; Also corrected the order of reporting to match the original.
hadoop fs -du -s -h /path
This will give you the summary.
For the whole cluster you can try :
hdfs dfsadmin -report
You may need to run this with HDFS user.
I have 300000+ files on a HDFS data directory.
When I do a hadoop fs -ls and I am getting an out of memory error saying GC Limit has exceeded. The cluster nodes have 256 GB of RAM each. How do I fix it?
You can make more memory available to the hdfs command by specifying 'HADOOP_CLIENT_OPTS'
HADOOP_CLIENT_OPTS="-Xmx4g" hdfs dfs -ls /
Found here: http://lecluster.delaurent.com/hdfs-ls-and-out-of-memory-gc-overhead-limit/
This fixed the problem for me, I had over 400k files in one directory and needed to delete most but not all of them.
Write a python script to split the files into multiple directories and run through them. First of all what are you trying to achieve when you know you have 300000+ files in a directory. If you want to concatenate better arrange them into sub dirs.