How to merge small blocks of a file in HDFS? - hadoop

We plan to append data to our files in NEW_BLOCK mode. This gives us more flexibility as to DN status.
Now after running the process for days, we find our 2mb file has too many blocks.
Is there a way to merge the blocks of a file - say bring down the 100 blocks for a file to 4.

Related

Writing a file larger than block size in hdfs

If I am trying to write a file of 200MB into HDFS where HDFS block size is 128MB. What happens if the write fails after writing 150MB out of 200MB. Will I be able to read data from the portion of data written? What if I try to write the same file again? Will that be a duplicate? What happens to the 150MB of data written earlier to failure?
HDFS default Block Size is 128MB, if it fails while writing (it will show the status in Hadoop Administration UI, with file extension copying.)
Only 150MB data will be copied.
yeah you can read only portion of data(150MB).
Once you reinstate the copying it will continue from previous point(if both the paths are same and file name is same).
For every piece of data you can find the replication based on your replication factor.
Previous written data will be available in HDFS.

Input Format to save image files (jpeg,png) in HDFS

I want to save image files (like jpeg, png etc) on HDFS (Hadoop File System). I tried two ways :
Saved the image files as it is (i.e in the same format) into HDFS using put command. The full command was : hadoop fs -put /home/a.jpeg /user/hadoop/. It was successfully placed.
Converted these image files into Hadoop's Sequence File format & then saved in HDFS using put command.
I want to know which format should be used to save in HDFS.
And what are the pros of using Sequence File format. One of the advantage that I know is that it is splittable. Is there any other ?
images are very small in size compare to block size of HDFS storage. The problem with small files is the impact on processing performance, This is why you should use Sequence Files, HAR, HBase or merging solutions. see these two threads more info.
effective way to store image files
How many files is too many on a modern HDP cluster?
Processing a 1Mb file has an overhead to it. So processing 128 1Mb
files will cost you 128 times more "administrative" overhead, versus
processing 1 128Mb file. In plain text, that 1Mb file may contain 1000
records. The 128 Mb file might contain 128000 records.

What is the best place to store multiple small files in hadoop

I will be having multiple small text files around size of 10KB, got confused where to store those files in HBase or in HDFS. what will be the optimized storage?
Because to store in HBase I need to parse it first then save it against some row key.
In HDFS I can directly create a path and save that file at that location.
But till now whatever I read, it says you should not have multiple small files instead create less big files.
But I can not merge those files, so I can't create big file out of small files.
Kindly suggest.
A large number of small files don´t fit very well with hadoop since each file is a hdfs block and each block require a one Mapper to be processed by default.
There are several options/strategies to minimize the impact of small files, all options require to process at least one time small files and "package" them in a better format. If you are planning to read these files several times, pre-process small files could make sense, but if you will use those files just one time then it doesn´t matter.
To process small files my sugesstion is to use CombineTextInputFormat (here an example): https://github.com/lalosam/HadoopInExamples/blob/master/src/main/java/rojosam/hadoop/CombinedInputWordCount/DriverCIPWC.java
CombineTextInputFormat use one Mapper to process several files but could require to transfer the files to a different DataNode to put files together in the DAtaNode where the map is running and could have a bad performance with speculative tasks but you can disable them if your cluster is enough stable.
Alternative to repackage small files are:
Create sequence files where each record contains one of the small files. With this option you will keep the original files.
Use IdentityMapper and IdentityReducer where the number of reducers are less than the number of files. This is the most easy approach but require that each line in the files be equals and independents (Not headers or metadata at the beginning of the files required to understand the rest of the file).
Create a external table in hive and then insert all the records for this table into a new table (INSERT INTO . . . SELECT FROM . . .). This approach have the same limitations than the option two and require to use Hive, the adventage is that you don´t require to write a MapReduce.
If you can not merge files like in option 2 or 3, my suggestion is to go with option 1
You could try using HAR archives: https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html
It's no problem with having many small different files. If for example you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created. If your files are completely different like 'apples' and 'employees' and can not be merged than just store them as is.

Hadoop Avro file size concern

I have a cronjob that that downloads zip files (200 bytes to 1MB) from a server on the internet every 5 minutes. If I import the zip files into HDFS as is, I encounter the infamous Hadoop small file size issue. In order to avoid the build up of small files in HDFS, process of the the text data in the zip files and convert them into avro files and wait every 6 hours to add my avro file into HDFS. Using this method, I have managed to get avro files imported into HDFS with a file size larger than 64MB. The files sizes range from 50MB to 400MB. What I'm concerned about is that what happens if I start building file sizes that start getting into the 500KB avro file size range or larger. Will this cause issues with Hadoop? How does everyone else handle this situation?
Assuming that you have some Hadoop post-aggregation step and that you're using some splittable compression type (sequence, snappy, none at all), you shouldn't face any issues from Hadoop's end.
If you would like your avro file sizes to be smaller, the easiest way to do this would be to make your aggregation window configurable and lower it when needed (6 hours => 3 hours?). Another way you might be able to ensure more uniformity in file sizes would be to keep a running count of lines seen from downloaded files and then combine upload after a certain line threshold has been reached.

Concatenate thousands of files using EMR

I currently have a process which reads files from AWS S3 and concatenates them using EMR.
The input files have the following format: 1 header row and 1 data row.
Fields are comma-separated and wrapped in double-quotes.
Example:
"header-field1","header-field2","header-field3",...
"data-field1","data-field2","data-field3",...
The files vary in size between 90 and 200 bytes.
The output file has the following format:
"header-field1","header-field2","header-field3",...
"file1-data-field1","file1-data-field2","file1-data-field3",...
"file2-data-field1","file2-data-field2","file2-data-field3",...
"file3-data-field1","file3-data-field2","file3-data-field3",...
....
My current approach uses a default mapper and a single reducer to concatenate all the data rows and prepend 1 header row at the top of the final output file.
Because I want to have a single header row in the output final, I was forced to use only 1 single reducer in my EMR job. This I feel, drastically increases run-time.
Early tests ran great with tens of files.
However, I am trying to scale this application to run for thousands of files with the final goal of concatenating 1 million.
My current process for 1000 files is still running after 30+ minutes, which is too long.
Do you have any suggestions on where I can improve my application to dramatically improve overall performance?
thank you.

Resources