Find out actual disk usage in HDFS - hadoop

Is there a way to find out how much space is consumed in HDFS?
I used
hdfs dfs -df
but it seems to be not relevant cause after deleting huge amount of data with
hdfs dfs -rm -r -skipTrash
the previous comand displays changes not at once but after several minutes (I need up-to-date disk usage info).

To see the space consumed by a particular folder try:
hadoop fs -du -s /folder/path
And if you want to see the usage, space consumed, space available, etc. of the whole HDFS:
hadoop dfsadmin -report

hadoop cli is deprecated. Use hdfs instead.
Folder wise :
sudo -u hdfs hdfs dfs -du -h /
Cluster wise :
sudo -u hdfs hdfs dfsadmin -report

hadoop fs -count -q /path/to/directory

Related

Know the disk space of data nodes in hadoop?

Is there a way or any command using which I can come to know the disk space of each datanode or the total cluster disk space?
I tried the command
dfs -du -h /
but it seems that I do not have permission to execute it for many directories and hence cannot get the actual disk space.
From UI:
http://namenode:50070/dfshealth.html#tab-datanode
---> which will give you all the details about datanode.
From command line:
To get disk space of each datanode:
sudo -u hdfs hdfs dfsadmin -report
---> which will give you the details of entire HDFS and the individual datanodes OR
sudo -u hdfs hdfs dfs -du -h /
---> which will give you the total disk usage of each folder under root / directory
You view the information about all datanodes and their disk usage in the namenode UI's Datanodes tab.
Total cluster disk space can be seen in the summary part of the main page.
http://namenode-ip:50070
If you are using Hadoop cluster configured as simple security, you can execute the below command to get the usage of data nodes.
export HADOOP_USER_NAME=hdfs ;
* Above command can be used to get admin privilege in simple security, If you are using any other user for hdfs admin, replace hdfs with the respective hdfs admin user.
hadoop dfsadmin -report
Alternate option is to login to respective datanode and execute the below unix command to get disk utilization of that server.
df -h
Hadoop 3.2.0:
hduser#hadoop-node1:~$ hdfs dfs -df
Filesystem Size Used Available Use%
hdfs://hadoop-node1:54310 3000457228288 461352007680 821808787456 15%
hduser#hadoop-node1:~$
For human-readable numbers, use:
hduser#hadoop-node1:~$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://hadoop-node1:54310 2.7 T 429.7 G 765.4 G 15%
hduser#hadoop-node1:~$

Remaining space on cloudera hadoop cluster in human readable format

I am looking for a command that shows the human readable form of the space left on hadoop cluster. I found a command on this forum and the output is in the image.
hdfs dfsadmin -report
[output of dfsadmin command][1]
I heard that there is another command in hortonworks that gives a more human readable output. And that command is hdfs dfsadmin -report
That command doesn't seem to work on cloudera.
Is there any equivalent command in cloudera?
Thanks much
It shouldn't matter whether you're using Cloudera or Hortonworks. If you're using an older version of hadoop the command might be hadoop dfsadmin -report.
Other options you have are:
hadoop fs -df -h
$ hadoop fs -df -h
Filesystem Size Used Available Use%
hdfs://<IP>:8020 21.8 T 244.2 G 21.6 T 1%
Shows the capacity, free and used space of the filesystem. If the filesystem has
multiple partitions, and no path to a particular partition is specified, then
the status of the root partitions will be shown.
hadoop fs -du -h /
$ hadoop fs -du -h /
772 /home
437.3 M /mnt
0 /tmp
229.2 G /user
9.3 G /var
Shows the amount of space, in bytes, used by the files that match the specified file pattern.

How to change replication factor while running copyFromLocal command?

I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders.
hadoop fs -setrep -R -w 3 <folder-path>
I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while copying data from local. I'm running following command,
hadoop fs -copyFromLocal <src> <dest>
When I run above commands, it copies the data from src to dest path with replication factor as 4. But I want to make replication factor as 1 while copying data but not after copying is complete. Bascially I want something like this,
hadoop fs -setrep -R 1 -copyFromLocal <src> <dest>
I tried it, but it didn't work. So, can it be done? or I've first copy data with replication factor 4 and then run setrep command?
According to this post and this post (both asking different questions), this command seems to work:
hadoop fs -D dfs.replication=1 -copyFromLocal <src> <dest>
The -D option means "Use value for given property."

I am getting errors while copying files from local to hdfs

I am getting error while copying files from local file system to hdfs,
will you please help me regarding this,
I am using this command :
hadoopd fs -put text.txt file
put and copyFromLocal command helps you to copy data from your local system to HDFS,provided you have the permission to do so.
hadoop fs -put /path/to/textfile /path/to/hdfs
OR
hadoop dfs -put /path/to/textfile /path/to/hdfs
Comming to your error:
You typed the above command as
hadoopd fs
use
hadoop dfs -put /text.txt /file
hadoop dfs -put /path/to/local/file /path/to/hdfs/file
You can use following command
hadoop fs -copyFromLocal text.txt <path_to_hdfs_directory_where_you_want_to_keep_text.txt>
Without knowing the specific error you are getting, it's difficult to answer. The other responders posted the proper syntax. However, it is not uncommon to see permission issues when attempting to copy files to HDFS.
By default the user and group are typically "hdfs" and "supergroup". Your user account likely doesn't belong to "supergroup" and will get permission denied errors. Try running the command as:
sudo -u hdfs hadoop fs -put /path/to/local/file /path/to/hdfs/file
or
sudo -u hdfs hadoop dfs -put /path/to/local/file /path/to/hdfs/file
You can get around having to do this by changing the ownership and permission of the destination directory on HDFS to be more permissive.
"DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser/myfile could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock". From this I thinrk your data node is not running/properly. Check that in cluster UI.Then try
hadoop dfs -put /path/file /hdfs/file (hadoop YARN)
hadoop fs -copyFromLocal /path/file /hdfs/file (hadoop1.x)

hadoop fs -put command

I have constructed a single-node Hadoop environment on CentOS using the Cloudera CDH repository. When I want to copy a local file to HDFS, I used the command:
sudo -u hdfs hadoop fs -put /root/MyHadoop/file1.txt /
But,the result depressed me:
put: '/root/MyHadoop/file1.txt': No such file or directory
I'm sure this file does exist.
Please help me,Thanks!
As user hdfs, do you have access rights to /root/ (in your local hdd)?. Usually you don't.
You must copy file1.txt to a place where local hdfs user has read rights before trying to copy it to HDFS.
Try:
cp /root/MyHadoop/file1.txt /tmp
chown hdfs:hdfs /tmp/file1.txt
# older versions of Hadoop
sudo -u hdfs hadoop fs -put /tmp/file1.txt /
# newer versions of Hadoop
sudo -u hdfs hdfs dfs -put /tmp/file1.txt /
--- edit:
Take a look at the cleaner roman-nikitchenko's answer bellow.
I had the same situation and here is my solution:
HADOOP_USER_NAME=hdfs hdfs fs -put /root/MyHadoop/file1.txt /
Advantages:
You don't need sudo.
You don't need actually appropriate local user 'hdfs' at all.
You don't need to copy anything or change permissions because of previous points.
try to create a dir in the HDFS by usig: $ hadoop fs -mkdir your_dir
and then put it into it $ hadoop fs -put /root/MyHadoop/file1.txt your_dir
Here is a command for writing df directly to hdfs file system in python script:
df.write.save('path', format='parquet', mode='append')
mode can be append | overwrite
If you want to put in in hdfs using shell use this command:
hdfs dfs -put /local_file_path_location /hadoop_file_path_location
You can then check on localhost:50070 UI for verification

Resources