Is there a way or any command using which I can come to know the disk space of each datanode or the total cluster disk space?
I tried the command
dfs -du -h /
but it seems that I do not have permission to execute it for many directories and hence cannot get the actual disk space.
From UI:
http://namenode:50070/dfshealth.html#tab-datanode
---> which will give you all the details about datanode.
From command line:
To get disk space of each datanode:
sudo -u hdfs hdfs dfsadmin -report
---> which will give you the details of entire HDFS and the individual datanodes OR
sudo -u hdfs hdfs dfs -du -h /
---> which will give you the total disk usage of each folder under root / directory
You view the information about all datanodes and their disk usage in the namenode UI's Datanodes tab.
Total cluster disk space can be seen in the summary part of the main page.
http://namenode-ip:50070
If you are using Hadoop cluster configured as simple security, you can execute the below command to get the usage of data nodes.
export HADOOP_USER_NAME=hdfs ;
* Above command can be used to get admin privilege in simple security, If you are using any other user for hdfs admin, replace hdfs with the respective hdfs admin user.
hadoop dfsadmin -report
Alternate option is to login to respective datanode and execute the below unix command to get disk utilization of that server.
df -h
Hadoop 3.2.0:
hduser#hadoop-node1:~$ hdfs dfs -df
Filesystem Size Used Available Use%
hdfs://hadoop-node1:54310 3000457228288 461352007680 821808787456 15%
hduser#hadoop-node1:~$
For human-readable numbers, use:
hduser#hadoop-node1:~$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://hadoop-node1:54310 2.7 T 429.7 G 765.4 G 15%
hduser#hadoop-node1:~$
Related
I am running a spark sql job on a small dataset (25 GB) and I always end up filling up the disk of things and eventually crashing up my executors.
$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://x.x.x.x:8020 138.9 G 3.9 G 14.2 G 3%
$ hdfs dfs -du -h
2.5 G .sparkStaging
0 archive
477.3 M checkpoint
When this append, I have to leave safemode.. hdfs dfsadmin -safemode leave
Looking at the spark job itself, it is obviously a problem of shuffle or caching dataframes. However, any idea why df reports such irregular Used/Available sizes? And why du does not list files?
I understand that it's related to the "non DFS usage" I can see in namenode overview. But why spark uses up so much "hidden" space to the point it makes my job crash?
Configured Capacity: 74587291648 (69.46 GB)
DFS Used: 476200960 (454.14 MB)
Non DFS Used: 67648394610 (63.00 GB)
DFS Remaining: 6462696078 (6.02 GB)
I just installed HDFS and launched the service,
and there is already more than 800MB of used space.
What does it represent ?
$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://quickstart.cloudera:8020 54.5 G 823.7 M 43.4 G 1%
I am looking for a command that shows the human readable form of the space left on hadoop cluster. I found a command on this forum and the output is in the image.
hdfs dfsadmin -report
[output of dfsadmin command][1]
I heard that there is another command in hortonworks that gives a more human readable output. And that command is hdfs dfsadmin -report
That command doesn't seem to work on cloudera.
Is there any equivalent command in cloudera?
Thanks much
It shouldn't matter whether you're using Cloudera or Hortonworks. If you're using an older version of hadoop the command might be hadoop dfsadmin -report.
Other options you have are:
hadoop fs -df -h
$ hadoop fs -df -h
Filesystem Size Used Available Use%
hdfs://<IP>:8020 21.8 T 244.2 G 21.6 T 1%
Shows the capacity, free and used space of the filesystem. If the filesystem has
multiple partitions, and no path to a particular partition is specified, then
the status of the root partitions will be shown.
hadoop fs -du -h /
$ hadoop fs -du -h /
772 /home
437.3 M /mnt
0 /tmp
229.2 G /user
9.3 G /var
Shows the amount of space, in bytes, used by the files that match the specified file pattern.
Our current HDFS Cluster has replication factor 1.But to improve the performance and reliability(node failure) we want to increase Hive intermediate files (hive.exec.scratchdir) replication factor alone to 5. Is it possible to implement that ?
Regards,
Selva
See if -setrep helps you.
setrep
Usage:
hadoop fs -setrep [-R] [-w] <numReplicas> <path>
Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path.
Options:
The -w flag requests that the command wait for the replication to complete. This can potentially take a very long time.
The -R flag is accepted for backwards compatibility. It has no effect.
Example:
hadoop fs -setrep -w 3 /user/hadoop/dir1
hadoop fs -setrep -R -w 100 /path/to/hive/warehouse
Reference: -setrep
Is there a way to find out how much space is consumed in HDFS?
I used
hdfs dfs -df
but it seems to be not relevant cause after deleting huge amount of data with
hdfs dfs -rm -r -skipTrash
the previous comand displays changes not at once but after several minutes (I need up-to-date disk usage info).
To see the space consumed by a particular folder try:
hadoop fs -du -s /folder/path
And if you want to see the usage, space consumed, space available, etc. of the whole HDFS:
hadoop dfsadmin -report
hadoop cli is deprecated. Use hdfs instead.
Folder wise :
sudo -u hdfs hdfs dfs -du -h /
Cluster wise :
sudo -u hdfs hdfs dfsadmin -report
hadoop fs -count -q /path/to/directory