How do you set the row group size of files in hdfs? - hadoop

I am running some experiments on block size (dfs.block.size) and row group size (parquet.block.size) in hdfs.
I have a large set of data in hdfs, and I want to replicate the data with various block sizes and row group sizes for testing. I'm able to copy the data with a different block size using:
hdfs dfs -D dfs.block.size=67108864 -D parquet.block.size=67108864 -cp /new_sample_parquet /new_sample_parquet_64M
But only the dfs.block.size gets changed. I am verifying with hdfs dfs -stat for the block size, and parquet-tools meta for the row group size. In fact, if I replace parquet.block.size with blah.blah.blah it has the same effect. I even went into spark-shell and set the parquet.block.size property manually using
sc.hadoopConfiguration.setInt("parquet.block.size", 67108864).
I am using hadoop 3.1.0. I got the property name of parquet.block.size from here.
Here is the first 10 rows of the output of my attempt
row group 1: RC:4140100 TS:150147503 OFFSET:4
row group 2: RC:3520100 TS:158294646 OFFSET:59176084
row group 3: RC:880100 TS:80122359 OFFSET:119985867
row group 4: RC:583579 TS:197303521 OFFSET:149394540
row group 5: RC:585594 TS:194850776 OFFSET:213638039
row group 6: RC:2620100 TS:130170698 OFFSET:277223867
row group 7: RC:2750100 TS:136761819 OFFSET:332088066
row group 8: RC:1790100 TS:86766854 OFFSET:389772650
row group 9: RC:2620100 TS:125876377 OFFSET:428147454
row group 10: RC:1700100 TS:83791047 OFFSET:483600973
As you can se, the TS (total size) is way larger than 64MB (67108864 bytes)
My current theory:
I am doing this in spark-shell:
sc.hadoopConfiguration.setInt("parquet.block.size", 67108864)
val a = spark.read.parquet("my_sample_data")
a.rdd.getNumPartitions // 1034
val s = a.coalesce(27)
s.write.format("parquet").mode("Overwrite").options(Map("dfs.block.size" -> "67108864")).save("/my_new_sample_data")
So perhaps it's because my input data already has 1034 partitions. I'm really not sure. My data has about 118 columns per row.

The parquet.block.size property only affects Parquet writers. The hdfs dfs -cp command copies files regardless of their contents on the other hand. The parquet.block.size property is therefore ignored by hdfs dfs -cp.
Imagine that you have an application that takes screenshots in either JPG or PNG format, depending on a config file. You make a copy of those screenshots with the cp command. Naturally, even if you change the desired image format in the config file, the cp command will always create output files in the image format of the original files, regardless of the config file. The config file is only used by the screenshot application and not by cp. This is how the parquet.block.size property works as well.
What you can do to change the block size is to rewrite the file. You mentioned that you have spark-shell. Use that to rewrite the Parquet file by issuing
sc.hadoopConfiguration.setInt("parquet.block.size", 67108864)
var df = spark.read.parquet("/path/to/input.parquet")
df.write.parquet("/path/to/output")
Update: Since you mentioned in the comments below that it does not work for you, I made an experiment and posting the session transcript below:
$ spark-shell
scala> sc.hadoopConfiguration.setInt("parquet.block.size", 200000)
scala> var df = spark.read.parquet("/tmp/infile.parquet")
df: org.apache.spark.sql.DataFrame = [field0000: binary, field0001: binary ... 78 more fields]
scala> df.write.parquet("/tmp/200K")
scala> df.write.format("parquet").mode("Overwrite").options(Map("parquet.block.size" -> "300000")).save("/tmp/300K")
scala> :quit
$ hadoop fs -copyToLocal /tmp/{200K,300K} /tmp
$ parquet-tools meta /tmp/infile.parquet | grep "row group" | head -n 3
row group 1: RC:4291 TS:5004800 OFFSET:4
row group 2: RC:3854 TS:4499360 OFFSET:5004804
row group 3: RC:4293 TS:5004640 OFFSET:10000000
$ parquet-tools meta /tmp/200K/part-00000-* | grep "row group" | head -n 3
row group 1: RC:169 TS:202080 OFFSET:4
row group 2: RC:168 TS:201760 OFFSET:190164
row group 3: RC:169 TS:203680 OFFSET:380324
$ parquet-tools meta /tmp/300K/part-00000-* | grep "row group" | head -n 3
row group 1: RC:254 TS:302720 OFFSET:4
row group 2: RC:255 TS:303280 OFFSET:284004
row group 3: RC:263 TS:303200 OFFSET:568884
By looking at the TS values you can see that the input file had a row group size of 4.5-5M and the output files have row groups sizes of 200K and 300K, respectively. This shows that the value set using sc.hadoopConfiguration becomes the "default", while the other method you mentioned in a comment below involving df.options overrides this default.
Update 2: Now that you have posted your output, I can see what is going on. In your case, compression is taking place, increasing the amount of data that will fit in row groups. The row group size applies to the compressed data, but TS shows the size of uncompressed data. However, you can deduce the size of row groups by subtracting their starting offsets. For example, the compressed size of your first row group is 59176084 - 4 = 59176080 bytes or less (since padding can take place as well). I copied your results into /tmp/rowgroups.dat on my computer and calculated your row group sizes by issuing the following command:
$ cat /tmp/rowgroups.dat | sed 's/.*OFFSET://' | numinterval
59176080
60809783
29408673
64243499
63585828
54864199
57684584
38374804
55453519
(The numinterval command is in the num-utils package on Ubuntu.) As you can see, all of your row groups are smaller than the row group size you specified. (The reason why they are not exactly the specified size is PARQUET-1337.)

Related

Hadoop does the returned file size include the replication factor?

I have file stored on HDFS and I need to get its size. I used the following line at the command prompt to get the file size
hadoop fs -du -s train.csv | awk '{{s+=$1}} END {{printf s}}
I know that Hadoop stores duplicates of files decided by the replication factor. So when I run the line above, is the returned size the file size time the replication factor or just the file size?
From Hadoop documentation:
The du returns three columns with the following format:
size disk_space_consumed_with_all_replicas full_path_name
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
As you can see the first column is size of file, while second column is space consumed including replicas.

Hadoop : Using Pig to add text at the end of every line of a hdfs file

We have files in HDFS with raw logs, each individual log is a line as these logs are line separated.
Our requirement is that to add a text (' 12345' for e.g. ) by the end of every log in these files ... using pig / hadoop command / or any other map reduce based tool.
Please advice
Thanks
AJ
Load the files where each log entry is loaded into one field i.e. line:chararray and use CONCAT to add the text to each line.Store it into new log file.If you want the individual files then you will have to parameterize the script to load each file and store into a new file instead of wildcard load.
Log = LOAD '/path/wildcard/*.log' USING TextLoader(line:chararray);
Log_Text = FOREACH Log GENERATE CONCAT(line,'Your Text') as newline;
STORE Log_Text INTO /path/NewLog.log';
If your files aren't extremely large, you can do that with a single shell command.
hdfs dfs -cat /user/hdfs/logfile.log | sed 's/$/12345/g' |\
hdfs dfs -put - /user/hdfs/newlogfile.txt

Data retention in Hadoop HDFS

We have a Hadoop cluster with over 100TB data in HDFS. I want to delete data older than 13 weeks in certain Hive tables.
Are there any tools or way I can achieve this?
Thank you
To delete data older then a certain time frame, you have a few options.
First, if the Hive table is partitioned by date, you could simply DROP the partitions within Hive and remove their underlying directories.
Second option would be to run an INSERT to a new table, filtering out the old data using a datestamp (if available). This is likely not a good option since you have 100TB of data.
A third option would be to recursively list the data directories for your Hive tables. hadoop fs -lsr /path/to/hive/table. This will output a list of the files and their creation dates. You can take this output, extract the date and compare against the time frame you want to keep. If the file is older then you want to keep, run a hadoop fs -rm <file> on it.
A fourth option would be to grab a copy of the FSImage: curl --silent "http://<active namenode>:50070/getimage?getimage=1&txid=latest" -o hdfs.image Next turn it into a text file. hadoop oiv -i hdfs.image -o hdfs.txt. The text file will contain a text representation of HDFS, the same as what hadoop fs -ls ... would return.

Hive - Possible to get total size of file parts in a directory?

Background:
I have some gzip files in a HDFS directory. These files are named in the format yyyy-mm-dd-000001.gz, yyyy-mm-dd-000002.gz and so on.
Aim:
I want to build a hive script which produces a table with the columns: Column 1 - date (yyyy-mm-dd), Column 2 - total file size.
To be specific, I would like to sum up the sizes of all of the gzip files for a particular date. The sum will be the value in Column 2 and the date in Column 1.
Is this possible? Are there any in-built functions or UDFs that could help me with my use case?
Thanks in advance!
A MapReduce job for this doesn't seem efficient since you don't actually have to load any data. Plus, doing this seems kind of awkward in Hive.
Can you write a bash script or python script or something like that to parse the output of hadoop fs -ls? I'd imagine something like this:
$ hadoop fs -ls mydir/*gz | python datecount.py | hadoop fs -put - counts.txt

Fastest access of a file using Hadoop

I need fastest access to a single file, several copies of which are stored in many systems using Hadoop. I also need to finding the ping time for each file in a sorted manner.
How should I approach learning hadoop to accomplish this task?
Please help fast.I have very less time.
If you need faster access to a file just increase the replication factor to that file using setrep command. This might not increase the file throughput proportionally, because of your current hardware limitations.
The ls command is not giving the access time for the directories and the files, it's showing the modification time only. Use the Offline Image Viewer to dump the contents of hdfs fsimage files to human-readable formats. Below is the command using the Indented option.
bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
A sample o/p from the fsimage.txt, look for the ACCESS_TIME column.
INODE
INODE_PATH = /user/praveensripati/input/sample.txt
REPLICATION = 1
MODIFICATION_TIME = 2011-10-03 12:53
ACCESS_TIME = 2011-10-03 16:26
BLOCK_SIZE = 67108864
BLOCKS [NUM_BLOCKS = 1]
BLOCK
BLOCK_ID = -5226219854944388285
NUM_BYTES = 529
GENERATION_STAMP = 1005
NS_QUOTA = -1
DS_QUOTA = -1
PERMISSIONS
USER_NAME = praveensripati
GROUP_NAME = supergroup
PERMISSION_STRING = rw-r--r--
To get the ping time in a sorted manner, you need to write a shell script or some other program to extract the INODE_PATH and ACCESS_TIME for each of the INODE section and then sort them based on the ACCESS_TIME. You can also use Pig as shown here.
How should I approach learning hadoop to accomplish this task? Please help fast.I have very less time.
If you want to learn Hadoop in a day or two it's not possible. Here are some videos and articles to start with.

Resources