If I change the size of HDFS block does the size of splits change too? - hadoop

If I change the size of HDFS block:
hadoop fs -D dfs.block.size=134217728 -put local_name remote_location
Does the size of splits (mapreduce.input.fileinputformat.split.minsize) change too?

-D dfs.block.size has no bearing on the split minsize. The "-D dfs.block.size" will only alter the block size for that file only.

Related

How can we set the block size in hadoop specific to each file?

for example if my input fie has 500MB i want this to split 250MB each, if my input file is 600MB block size should be 300MB
If you are loading files into hdfs you can put with dfs.blocksize oprtion, you can calculate parameter in a shell depending on size.
hdfs dfs -D dfs.blocksize=268435456 -put myfile /some/hdfs/location
If you already have files in HDFS and want to change it's block size, you need to rewrite it.
(1) move file to tmp location:
hdfs dfs -mv /some/hdfs/location/myfile /tmp
(2) Copy it back with -D dfs.blocksize=268435456
hdfs dfs -D dfs.blocksize=268435456 -cp /tmp/myfile /some/hdfs/location

Hadoop does the returned file size include the replication factor?

I have file stored on HDFS and I need to get its size. I used the following line at the command prompt to get the file size
hadoop fs -du -s train.csv | awk '{{s+=$1}} END {{printf s}}
I know that Hadoop stores duplicates of files decided by the replication factor. So when I run the line above, is the returned size the file size time the replication factor or just the file size?
From Hadoop documentation:
The du returns three columns with the following format:
size disk_space_consumed_with_all_replicas full_path_name
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
As you can see the first column is size of file, while second column is space consumed including replicas.

How do I determine the size of my HBase Tables ?. Is there any command to do so?

I have multiple tables on my Hbase shell that I would like to copy onto my file system. Some tables exceed 100gb. However, I only have 55gb free space left in my local file system. Therefore, I would like to know the size of my hbase tables so that I could export only the small sized tables. Any suggestions are appreciated.
Thanks,
gautham
try
hdfs dfs -du -h /hbase/data/default/ (or /hbase/ depending on hbase version you use)
This will show how much space is used by files of your tables.
Hope that will help.
for 0.98+ try hadoop fs -du -s -h $hbase_root_dir/data/data/$schema_name/ (or /hbase/ for 0.94)
You can find hbase_root_dir from hbase-site.xml file of your cluster.
The above command will provide you summary of disk used by each table.
use du
Usage: hdfs dfs -du [-s] [-h] URI [URI …]
Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
Options:
The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
The -h option will format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)
Example:
hdfs dfs -du -h /hbase/data/default
output for me:
1.2 M /hbase/data/default/kylin_metadata
14.0 K /hbase/data/default/kylin_metadata_acl
636 /hbase/data/default/kylin_metadata_user
5.6 K /hbase/data/default/test

Checking filesize and its distribution in HDFS

Is it possible to know filesize in blocks and its distribution over DataNodes in Hadoop?
Currently I am using:
frolo#A11:~/hadoop> $HADOOP_HOME/bin/hadoop dfs -stat "%b %o %r %n" /user/frolo/input/rmat-*
318339 67108864 1 rmat-10.0
392835957 67108864 1 rmat-20.0
Which does not show actual number of blocks created after uploading file to HDFS. And I dont know any way how to find out its distribution.
Thanks,
Alex
The %r in your stat command shows the replication factor of the queried file. If this is 1, it means there will only be only a single replica across the cluster for blocks belonging to this file. The hadoop fs -ls output also shows this value for listed files as one of its numeric columns, as replication factor is a per file FS attribute.
If you are looking to find where the blocks reside instead, you are looking for hdfs fsck (or hadoop fsck if using a dated release) instead. The below, for example, will let you see the list of block IDs and their respective set of resident locations, for any file:
hdfs fsck /user/frolo/input/rmat-10.0 -files -blocks -locations

Hadoop fs lookup for block size?

In Hadoop fs how to lookup the block size for a particular file?
I was primarily interested in a command line, something like:
hadoop fs ... hdfs://fs1.data/...
But it looks like that does not exist. Is there a Java solution?
The fsck commands in the other answers list the blocks and allow you to see the number of blocks. However, to see the actual block size in bytes with no extra cruft do:
hadoop fs -stat %o /filename
Default block size is:
hdfs getconf -confKey dfs.blocksize
Details about units
The units for the block size are not documented in the hadoop fs -stat command, however, looking at the source line and the docs for the method it calls we can see it uses bytes and cannot report block sizes over about 9 exabytes.
The units for the hdfs getconf command may not be bytes. It returns whatever string is being used for dfs.blocksize in the configuration file. (This is seen in the source for the final function and its indirect caller)
Seems hadoop fs doesn't have options to do this.
But hadoop fsck could.
You can try this
$HADOOP_HOME/bin/hadoop fsck /path/to/file -files -blocks
I think it should be doable with:
hadoop fsck /filename -blocks
but I get Connection refused
Try to code below
path=hdfs://a/b/c
size=`hdfs dfs -count ${path} | awk '{print $3}'`
echo $size
For displaying the actual block size of the existing file within HDFS I used:
[pety#master1 ~]$ hdfs dfs -stat %o /tmp/testfile_64
67108864

Resources