Hadoop fs lookup for block size? - hadoop

In Hadoop fs how to lookup the block size for a particular file?
I was primarily interested in a command line, something like:
hadoop fs ... hdfs://fs1.data/...
But it looks like that does not exist. Is there a Java solution?

The fsck commands in the other answers list the blocks and allow you to see the number of blocks. However, to see the actual block size in bytes with no extra cruft do:
hadoop fs -stat %o /filename
Default block size is:
hdfs getconf -confKey dfs.blocksize
Details about units
The units for the block size are not documented in the hadoop fs -stat command, however, looking at the source line and the docs for the method it calls we can see it uses bytes and cannot report block sizes over about 9 exabytes.
The units for the hdfs getconf command may not be bytes. It returns whatever string is being used for dfs.blocksize in the configuration file. (This is seen in the source for the final function and its indirect caller)

Seems hadoop fs doesn't have options to do this.
But hadoop fsck could.
You can try this
$HADOOP_HOME/bin/hadoop fsck /path/to/file -files -blocks

I think it should be doable with:
hadoop fsck /filename -blocks
but I get Connection refused

Try to code below
path=hdfs://a/b/c
size=`hdfs dfs -count ${path} | awk '{print $3}'`
echo $size

For displaying the actual block size of the existing file within HDFS I used:
[pety#master1 ~]$ hdfs dfs -stat %o /tmp/testfile_64
67108864

Related

How can we set the block size in hadoop specific to each file?

for example if my input fie has 500MB i want this to split 250MB each, if my input file is 600MB block size should be 300MB
If you are loading files into hdfs you can put with dfs.blocksize oprtion, you can calculate parameter in a shell depending on size.
hdfs dfs -D dfs.blocksize=268435456 -put myfile /some/hdfs/location
If you already have files in HDFS and want to change it's block size, you need to rewrite it.
(1) move file to tmp location:
hdfs dfs -mv /some/hdfs/location/myfile /tmp
(2) Copy it back with -D dfs.blocksize=268435456
hdfs dfs -D dfs.blocksize=268435456 -cp /tmp/myfile /some/hdfs/location

“hdfs dfs -du” vs “hdfs dfs -count”, differences on expecting same results

Why hdfs dfs -du -s and hdfs dfs -count -v (supposeed same bytes at CONTENT_SIZE field) are (near but) not the same values?
Example
# at user1#borderNode1
hdfs dfs -count -v "hdfs://XYZ/apps/hive/warehouse/p_xx_db.db"
# DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
# 9087 1610048 141186781009632 hdfs://XYZ/apps/hive/warehouse/p_xx_db.db
hdfs dfs -du -s "hdfs://XYZ/apps/hive/warehouse/p_xx_db.db"
#141186781010380 hdfs://XYZ/apps/hive/warehouse/p_xx_db.db
The value 141186781009632 is not 141186781010380.
The difference 141186781010380-141186781009632=748 is less tham the blocksize (134217728 in the example)... so, perhaps, one is exact and other not, but I not see this kind of documentation on Hadoop.
PS: no clues here neither at guide,
hdfs dfs -count: "Count the number of ... bytes under the directory... output column CONTENT_SIZE".
dfs -du: "Displays sizes files... contained in the given directory".
Guide say only that both are number of bytes contained under the directory.

How to know the exact block size of a file on a Hadoop node?

I have a 1 GB file that I've put on HDFS. So, it would be broken into blocks and sent to different nodes in the cluster.
Is there any command to identify the exact size of the block of the file on a particular node?
Thanks.
You should use hdfs fsck command:
hdfs fsck /tmp/test.txt -files -blocks
This command will print information about all the blocks of which file consists:
/tmp/test.tar.gz 151937000 bytes, 2 block(s): OK
0. BP-739546456-192.168.20.1-1455713910789:blk_1073742021_1197 len=134217728 Live_repl=3
1. BP-739546456-192.168.20.1-1455713910789:blk_1073742022_1198 len=17719272 Live_repl=3
As you can see here is shown (len field in every row) actual used capacities of blocks.
Also there are many another useful features of hdfs fsck which you can see at the official Hadoop documentation page.
You can try:
hdfs getconf -confKey dfs.blocksize
I do not have reputation to comment.
Have a look at documentation page to set various properties, which covers
dfs.blocksize
Apart from configuration change, you can view actual size of file with
hadoop fs -ls fileNameWithPath
e.g.
hadoop fs -ls /user/edureka
output:
-rwxrwxrwx 1 edureka supergroup 391355 2014-09-30 12:29 /user/edureka/cust

Checksum verification in Hadoop

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?
I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?
I read client does checksum before data is written to HDFS
Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.
If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.
In the below example I am comparing two files with the same content in different locations:
Old-school md5sum method returns the same checksum:
$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
However, checksum generated on the HDFS is different for files with the same content:
$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914
$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e
A bit puzzling as I would expect identical checksum to be generated against the identical content.
Checksum for a file can be calculated using hadoop fs command.
Usage: hadoop fs -checksum URI
Returns the checksum information of a file.
Example:
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///path/in/linux/file1
Refer : Hadoop documentation for more details
So if you want to comapre file1 in both linux and hdfs you can use above utility.
I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.
So, you can compare the checksum to cross check.
https://github.com/srch07/HDFSChecksumForLocalfile
If you are doing this check via API
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a
val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString
Option 2: for the value 3e50be59553b2ddaf401c575f8df6914
val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)
It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

Checking filesize and its distribution in HDFS

Is it possible to know filesize in blocks and its distribution over DataNodes in Hadoop?
Currently I am using:
frolo#A11:~/hadoop> $HADOOP_HOME/bin/hadoop dfs -stat "%b %o %r %n" /user/frolo/input/rmat-*
318339 67108864 1 rmat-10.0
392835957 67108864 1 rmat-20.0
Which does not show actual number of blocks created after uploading file to HDFS. And I dont know any way how to find out its distribution.
Thanks,
Alex
The %r in your stat command shows the replication factor of the queried file. If this is 1, it means there will only be only a single replica across the cluster for blocks belonging to this file. The hadoop fs -ls output also shows this value for listed files as one of its numeric columns, as replication factor is a per file FS attribute.
If you are looking to find where the blocks reside instead, you are looking for hdfs fsck (or hadoop fsck if using a dated release) instead. The below, for example, will let you see the list of block IDs and their respective set of resident locations, for any file:
hdfs fsck /user/frolo/input/rmat-10.0 -files -blocks -locations

Resources