How do you retrieve the replication factor info in Hdfs files? - hadoop

I have set the replication factor for my file as follows:
hadoop fs -D dfs.replication=5 -copyFromLocal file.txt /user/xxxx
When a NameNode restarts, it makes sure under-replicated blocks are replicated.
Hence the replication info for the file is stored (possibly in nameNode). How can I get that information?

Try to use command hadoop fs -stat %r /path/to/file, it should print the replication factor.

You can run following command to get replication factor,
hadoop fs -ls /user/xxxx
The second column in the output signify replication factor for the file and for the folder it shows -, as shown in below pic.

Apart from Alexey Shestakov's answer, which works perfectly and does exactly what you ask, other ways, mostly found here, include:
hadoop dfs -ls /parent/path
which shows the replication factors of all the /parent/path contents on the second column.
Through Java, you can get this information by using:
FileStatus.getReplication()
You can also see the replication factors of files by using:
hadoop fsck /filename -files -blocks -racks
Finally, from the web UI of the namenode, I believe that this information is also available (didn't check that).

We can use following commands to check replication of the file.
hdfs dfs -ls /user/cloudera/input.txt
or
hdfs dfs -stat %r /user/cloudera/input.txt

In case if you need to check replication factor of a HDFS directory
hdfs fsck /tmp/data
shows the average replication factor of /tm/data/ HDFS folder

Related

why mapreduce doesn't get launched when using hadoop fs -put command?

Please excuse me for this basic question.
But I wonder why mapreduce job don't get launched when we try to load some file having size more than the block size.
Somewhere I learnt that MapReduce will take care of loading the datasets from LFS to HDFS. Then why I am not able to see mapreduce logs on the console when I give hadoop fs -put command?
thanks in Advance.
You're thinking of hadoop distcp which will spawn a MapReduce job.
https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html
DistCp Version 2 (distributed copy) is a tool used for large inter/intra cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
hadoop fs -put or hdfs dfs -put are implemented entirely by HDFS and don't require MapReduce.

Find out directory size considering replication in HDFS

Is there any way to find out raw HDFS space consumption by a directory. As far as I know
hdfs dfs -du -s /dir
shows /dir size not considering replication of inner files.
Run the command hadoop fsck /dir and look for the parameter Average block replication. Multiple this number by the result you have from hdfs dfs -du -s /dir.

How to find out which application consume the most space on hadoop?

My hadoop cluster shows it has less than 20% disk space left. I am using this command to see disk space
hdfs dfsadmin -report
However, i don't know which directory/files that takes up the most space. Is there a way to find out?
use the following command.
hdfs dfs -du /
It displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

How to see entire root hdfs disk usage? (hadoop dfs -du / gets subfolders)

We have perhaps not unsurprisingly given how fascinating big data is to the business, a disk space issue we'd like to monitor on our hadoop clusters.
I have a cron job running and it is doing just what I want except that I'd like one of the output lines to show the overall space used. In other words, in bash, the very last line of a "du /" command shows the total usage for all the subfolders on the entire disk. I'd like that behavior.
Currently when I run "hadoop dfs -du /", however, I get only the subdirectory info and not the overall total.
What's the best way to get this?
thank you so much to all you Super Stack Overflow people :).
I just didn't understand the docs correctly! Here is the answer to get the total space used;
$ hadoop dfs -dus /
hdfs://MYSERVER.com:MYPORT/ 999
$ array=(`hadoop dfs -dus /`)
$ echo $array
hdfs://MYURL:MYPORT/
$ echo ${array[1]} ${array[0]}
999 hdfs://MYURL:MYPORT/
Reference; File System Shell Guide
http://hadoop.apache.org/docs/r1.2.1/file_system_shell.html#du
//edit; Also corrected the order of reporting to match the original.
hadoop fs -du -s -h /path
This will give you the summary.
For the whole cluster you can try :
hdfs dfsadmin -report
You may need to run this with HDFS user.

Where are my files(dir) stored when i used the hadoop fs -mkdir?

I'm totally new to hadoop and just finished installing which took me 2 days...
I'm now trying with the hadoop dfs command, but i just couldn't understand it, although i've been browsing for days, i couldnt find the answer to what i want to know.
All the examples shows what the result is supposed to be, without explaining the real structure of it, so i will be happy if someone could assist me in understanding hadoop hdfs.
I've created a directory on the HDFS.
bin/hadoop fs -mkdir input
OK, i shall check on it with the ls command.
bin/hadoop fs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2012-07-30 11:08 input
OK, no problem, everything seems perfect.. BUT where is actually the HDFS data stored?
I thought it would store in the my datanode directory (/home/hadoop/datastore), which was defined in core-site.xml under hadoop.tmp.dir, but it is not there..
Then i tried to view through the WEB-UI and i found that "input" was created under "/user/hadoop/" (/user/hadoop/input).
My questions are
(1) What are the datanode directory (hadoop.tmp.dir) used for, since it doesnt store everything i processed through dfs command?
(2) Everything created with dfs command goes to /user/XXX/ , how to change the value of it?
(3) I cant see anything when i try to access through normal linux command (ls /user/hadoop). Does /user/hadoop exists logically?
I'm sorry if my questions are stupid..
a newbie struggling to understand hadoop better..
Thank you in advance.
Hdfs is not a posix file system and you have to use hadoop api to read and view this file system. That's the reason you have to do hadoop fs -ls as you are using hadoop API to read files here. Data in hdfs are stored in blocks and is stored in all datanodes. Metadata about this file system is stored on Namenode. The data files you see in the directory "/home/hadoop/datastore " are blocks stored on individual datanode.
I think you should explore more about its file system in its tutorial. Yahoo, YDN tutorial on hdfs

Resources