Hadoop does the returned file size include the replication factor? - hadoop

I have file stored on HDFS and I need to get its size. I used the following line at the command prompt to get the file size
hadoop fs -du -s train.csv | awk '{{s+=$1}} END {{printf s}}
I know that Hadoop stores duplicates of files decided by the replication factor. So when I run the line above, is the returned size the file size time the replication factor or just the file size?

From Hadoop documentation:
The du returns three columns with the following format:
size disk_space_consumed_with_all_replicas full_path_name
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
As you can see the first column is size of file, while second column is space consumed including replicas.

Related

How do you set the row group size of files in hdfs?

I am running some experiments on block size (dfs.block.size) and row group size (parquet.block.size) in hdfs.
I have a large set of data in hdfs, and I want to replicate the data with various block sizes and row group sizes for testing. I'm able to copy the data with a different block size using:
hdfs dfs -D dfs.block.size=67108864 -D parquet.block.size=67108864 -cp /new_sample_parquet /new_sample_parquet_64M
But only the dfs.block.size gets changed. I am verifying with hdfs dfs -stat for the block size, and parquet-tools meta for the row group size. In fact, if I replace parquet.block.size with blah.blah.blah it has the same effect. I even went into spark-shell and set the parquet.block.size property manually using
sc.hadoopConfiguration.setInt("parquet.block.size", 67108864).
I am using hadoop 3.1.0. I got the property name of parquet.block.size from here.
Here is the first 10 rows of the output of my attempt
row group 1: RC:4140100 TS:150147503 OFFSET:4
row group 2: RC:3520100 TS:158294646 OFFSET:59176084
row group 3: RC:880100 TS:80122359 OFFSET:119985867
row group 4: RC:583579 TS:197303521 OFFSET:149394540
row group 5: RC:585594 TS:194850776 OFFSET:213638039
row group 6: RC:2620100 TS:130170698 OFFSET:277223867
row group 7: RC:2750100 TS:136761819 OFFSET:332088066
row group 8: RC:1790100 TS:86766854 OFFSET:389772650
row group 9: RC:2620100 TS:125876377 OFFSET:428147454
row group 10: RC:1700100 TS:83791047 OFFSET:483600973
As you can se, the TS (total size) is way larger than 64MB (67108864 bytes)
My current theory:
I am doing this in spark-shell:
sc.hadoopConfiguration.setInt("parquet.block.size", 67108864)
val a = spark.read.parquet("my_sample_data")
a.rdd.getNumPartitions // 1034
val s = a.coalesce(27)
s.write.format("parquet").mode("Overwrite").options(Map("dfs.block.size" -> "67108864")).save("/my_new_sample_data")
So perhaps it's because my input data already has 1034 partitions. I'm really not sure. My data has about 118 columns per row.
The parquet.block.size property only affects Parquet writers. The hdfs dfs -cp command copies files regardless of their contents on the other hand. The parquet.block.size property is therefore ignored by hdfs dfs -cp.
Imagine that you have an application that takes screenshots in either JPG or PNG format, depending on a config file. You make a copy of those screenshots with the cp command. Naturally, even if you change the desired image format in the config file, the cp command will always create output files in the image format of the original files, regardless of the config file. The config file is only used by the screenshot application and not by cp. This is how the parquet.block.size property works as well.
What you can do to change the block size is to rewrite the file. You mentioned that you have spark-shell. Use that to rewrite the Parquet file by issuing
sc.hadoopConfiguration.setInt("parquet.block.size", 67108864)
var df = spark.read.parquet("/path/to/input.parquet")
df.write.parquet("/path/to/output")
Update: Since you mentioned in the comments below that it does not work for you, I made an experiment and posting the session transcript below:
$ spark-shell
scala> sc.hadoopConfiguration.setInt("parquet.block.size", 200000)
scala> var df = spark.read.parquet("/tmp/infile.parquet")
df: org.apache.spark.sql.DataFrame = [field0000: binary, field0001: binary ... 78 more fields]
scala> df.write.parquet("/tmp/200K")
scala> df.write.format("parquet").mode("Overwrite").options(Map("parquet.block.size" -> "300000")).save("/tmp/300K")
scala> :quit
$ hadoop fs -copyToLocal /tmp/{200K,300K} /tmp
$ parquet-tools meta /tmp/infile.parquet | grep "row group" | head -n 3
row group 1: RC:4291 TS:5004800 OFFSET:4
row group 2: RC:3854 TS:4499360 OFFSET:5004804
row group 3: RC:4293 TS:5004640 OFFSET:10000000
$ parquet-tools meta /tmp/200K/part-00000-* | grep "row group" | head -n 3
row group 1: RC:169 TS:202080 OFFSET:4
row group 2: RC:168 TS:201760 OFFSET:190164
row group 3: RC:169 TS:203680 OFFSET:380324
$ parquet-tools meta /tmp/300K/part-00000-* | grep "row group" | head -n 3
row group 1: RC:254 TS:302720 OFFSET:4
row group 2: RC:255 TS:303280 OFFSET:284004
row group 3: RC:263 TS:303200 OFFSET:568884
By looking at the TS values you can see that the input file had a row group size of 4.5-5M and the output files have row groups sizes of 200K and 300K, respectively. This shows that the value set using sc.hadoopConfiguration becomes the "default", while the other method you mentioned in a comment below involving df.options overrides this default.
Update 2: Now that you have posted your output, I can see what is going on. In your case, compression is taking place, increasing the amount of data that will fit in row groups. The row group size applies to the compressed data, but TS shows the size of uncompressed data. However, you can deduce the size of row groups by subtracting their starting offsets. For example, the compressed size of your first row group is 59176084 - 4 = 59176080 bytes or less (since padding can take place as well). I copied your results into /tmp/rowgroups.dat on my computer and calculated your row group sizes by issuing the following command:
$ cat /tmp/rowgroups.dat | sed 's/.*OFFSET://' | numinterval
59176080
60809783
29408673
64243499
63585828
54864199
57684584
38374804
55453519
(The numinterval command is in the num-utils package on Ubuntu.) As you can see, all of your row groups are smaller than the row group size you specified. (The reason why they are not exactly the specified size is PARQUET-1337.)

Hadoop : Using Pig to add text at the end of every line of a hdfs file

We have files in HDFS with raw logs, each individual log is a line as these logs are line separated.
Our requirement is that to add a text (' 12345' for e.g. ) by the end of every log in these files ... using pig / hadoop command / or any other map reduce based tool.
Please advice
Thanks
AJ
Load the files where each log entry is loaded into one field i.e. line:chararray and use CONCAT to add the text to each line.Store it into new log file.If you want the individual files then you will have to parameterize the script to load each file and store into a new file instead of wildcard load.
Log = LOAD '/path/wildcard/*.log' USING TextLoader(line:chararray);
Log_Text = FOREACH Log GENERATE CONCAT(line,'Your Text') as newline;
STORE Log_Text INTO /path/NewLog.log';
If your files aren't extremely large, you can do that with a single shell command.
hdfs dfs -cat /user/hdfs/logfile.log | sed 's/$/12345/g' |\
hdfs dfs -put - /user/hdfs/newlogfile.txt

How do I determine the size of my HBase Tables ?. Is there any command to do so?

I have multiple tables on my Hbase shell that I would like to copy onto my file system. Some tables exceed 100gb. However, I only have 55gb free space left in my local file system. Therefore, I would like to know the size of my hbase tables so that I could export only the small sized tables. Any suggestions are appreciated.
Thanks,
gautham
try
hdfs dfs -du -h /hbase/data/default/ (or /hbase/ depending on hbase version you use)
This will show how much space is used by files of your tables.
Hope that will help.
for 0.98+ try hadoop fs -du -s -h $hbase_root_dir/data/data/$schema_name/ (or /hbase/ for 0.94)
You can find hbase_root_dir from hbase-site.xml file of your cluster.
The above command will provide you summary of disk used by each table.
use du
Usage: hdfs dfs -du [-s] [-h] URI [URI …]
Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
Options:
The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
The -h option will format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)
Example:
hdfs dfs -du -h /hbase/data/default
output for me:
1.2 M /hbase/data/default/kylin_metadata
14.0 K /hbase/data/default/kylin_metadata_acl
636 /hbase/data/default/kylin_metadata_user
5.6 K /hbase/data/default/test

If I change the size of HDFS block does the size of splits change too?

If I change the size of HDFS block:
hadoop fs -D dfs.block.size=134217728 -put local_name remote_location
Does the size of splits (mapreduce.input.fileinputformat.split.minsize) change too?
-D dfs.block.size has no bearing on the split minsize. The "-D dfs.block.size" will only alter the block size for that file only.

Checking filesize and its distribution in HDFS

Is it possible to know filesize in blocks and its distribution over DataNodes in Hadoop?
Currently I am using:
frolo#A11:~/hadoop> $HADOOP_HOME/bin/hadoop dfs -stat "%b %o %r %n" /user/frolo/input/rmat-*
318339 67108864 1 rmat-10.0
392835957 67108864 1 rmat-20.0
Which does not show actual number of blocks created after uploading file to HDFS. And I dont know any way how to find out its distribution.
Thanks,
Alex
The %r in your stat command shows the replication factor of the queried file. If this is 1, it means there will only be only a single replica across the cluster for blocks belonging to this file. The hadoop fs -ls output also shows this value for listed files as one of its numeric columns, as replication factor is a per file FS attribute.
If you are looking to find where the blocks reside instead, you are looking for hdfs fsck (or hadoop fsck if using a dated release) instead. The below, for example, will let you see the list of block IDs and their respective set of resident locations, for any file:
hdfs fsck /user/frolo/input/rmat-10.0 -files -blocks -locations

Resources