Hadoop Mapreduce with compressed/encrypted files (file of large size) - hadoop

I have hdfs cluster which stores large csv files in a compressed/encrypted form as selected by end user.
For compression, encryption, I have create a wrapper input stream which feed data to HDFS in compressed/encrypted form. Compression format used GZ, Encryption format AES256.
A 4.4GB csv file is compressed to 40MB on HDFS.
Now I have mapreduce job(java) which processes multiple compressed files together. MR job uses FileInputFormat.
When splits are calculated by mapper, 4.4GB compressed file(40MB) is allocated only 1 mapper with split start as 0 and split length equivalent 40MB.
How do I process such compressed file of larger size.? One option I found was to implement custom RecordReader and use wrapper input stream to read uncompressed data and process it.
Since I don't have actual length of the file, so I don't know how much data to read from input stream.
If I read upto end from InputStream, then how to handle when 2 mappers are allocated to same file as explained below.
If compressed file size is larger than 64MB, then 2 mappers wil be allocated for same file.
How to handle this scenario.?
Hadoop Version - 2.7.1

The compression format should be decided keeping in mind if the file would be processed by map reduce. Because, is the compression format is splittable, then map reduce works normally.
However, if not splittable(in your case gzip is not splittable, and map reduce will know it), then entire file would be processed in one mapper. This will serve the purpose, but will have data locality issues, as one mapper will only perform the job, and it fetches the data from other blocks.
From Hadoop definitive guide:
"For large files, you should not use a compression format that does not support splitting on the whole file, because you lose locality and make MapReduce applications very inefficient".
You can refer to the section compression in Hadoop I/O chapter, for more information.

Related

How does Hadoop HDFS decide what data to be put into each block?

I have been trying to dive into how Hadoop HDFS decides what data to be put into one block and don't seem to find any solid answer. We know that Hadoop will automatically distribute data into blocks in HDFS across the cluster, however what data of each file should be put together in a block? Will it just put it arbitrarily ? And is this the same for Spark RDD?
HDFS block behavior
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
spark reading a HDFS splittable file:
sc.textFile doesn't commence any reading. It simply defines a driver-resident data structure which can be used for further processing.
It is not until an action is called on an RDD that Spark will build up a strategy to perform all the required transforms (including the read) and then return the result.
If there is an action called to run the sequence, and your next transformation after the read is to map, then Spark will need to read a small section of lines of the file (according to the partitioning strategy based on the number of cores) and then immediately start to map it until it needs to return a result to the driver, or shuffle before the next sequence of transformations.
If your partitioning strategy (defaultMinPartitions) seems to be swamping the workers because the java representation of your partition (an InputSplit in HDFS terms) is bigger than available executor memory, then you need to specify the number of partitions to read as the second parameter to textFile. You can calculate the ideal number of partitions by dividing your file size by your target partition size (allowing for memory growth). A simple check that the file can be read would be:
sc.textFile(file, numPartitions)
.count()

Compressed file VS uncompressed file in mapreduce. which one gives better performence?

I have a 10 GB csv file and i want to process it in Hadoop MapReduce.
I have a 15 nodes(Datanode) cluster and i want to maximize the throughput.
What compression format should i use ? or Text file without compression will always give me better result over the compressed Text file. please explain the reason.
I used uncompressed file and it gave me better results over Snappy . Why is it so?
The problem with Snappy compression is that it is not splittable, so Hadoop can't divide input file into chunks and run several mappers for input. So most likely your 10Gb file is processed by a single mapper (check it in application history UI). Since hadoop stores big files in separate blocks on different machines, some parts of this file are not even located on the mapper machine and have to be transferred over the network. That seems to be the main reason why Snappy compressed file works slower than plain text.
To avoid the problem you can use bzip2 compression or divide the file into chunks manually and compress each part with snappy.

Why does mapreduce split a compressed file into input splits?

So from my understanding, when hdfs stores a bzip2 compressed 1GB file with a block size of 64 MB, the file will be stored as 16 different blocks. If I want to run a map-reduce job on this compressed file, map reduce tries to split the file again. Why doesn't mapreduce automatically use the 16 blocks in hdfs instead of splitting the file again?
I think I see where your confusion is coming from. I'll attempt to clear it up.
HDFS slices your file up into blocks. These are physical partitions of the file.
MapReduce creates logical splits on top of these blocks. These splits are defined based on a number of parameters, with block boundaries and locations being a huge factor. You can set your minimum split size to 128MB, in which case each split will likely be exactly two 64MB blocks.
All of this is unrelated to your bzip2 compression. If you had used gzip compression, each split would be an entire file because gzip is not a splittable compression.

Advantages of Sequence file over hdfs textfile

What is the advantage of Hadoop Sequence File over HDFS flat file(Text)? In what way Sequence file is efficient?
Small files can be combined and written into a sequence file, but the same can be done for a HDFS text file also. Need to know the difference between the two ways. I have been googling about this for a while, would be helpful if i get clarity on this?
Sequence files are appropriate for situations in which you want to store keys and their corresponding values. For text files you can do that but you have to parse each line.
Can be compressed and still be splittable which means better workload. You can't split a compressed text file unless you use a splittable compression format.
Can be approached as binary files => more storage efficient. In a text file a double will be a number of chars => large storage overhead.
Advantages of Hadoop Sequence files ( As per Siva's article from hadooptutorial.info website)
More compact than text files
Provides support for compression at different levels - Block or Record etc.
Files can be split and processed in parallel
They can solve large number of small files problem in Hadoop where Hadoop main advantage is processing large file with Map reduce jobs. It can be used as a container for large number of small files
Temporary output of Mapper can be stored in sequential files
Disadvantages:
Sequential files are append only
Sequence files are intermediate files generated during mapper and reducer phase of MapReduce processing. Sequence file are compressible and fast in processing it is used to write output during mapper and reducer reds from it.
There are APIs in Hadoop and Spark to read/write sequence files

Processing Large Binary Files with Hadoop

I know there have been similar posts on here but I can't find one that really has a solid answer.
We have a Hadoop cluster loaded with binary files. These files can range anywhere in size from a few hundred k to hundreds of mb.
We are currently processing these files using a custom record reader that reads the entire contents of the file into each map. From there we extract the appropriate metadata we want a serialize it into JSON.
The problem we are foreseeing is that we might eventually reach a size that our namenode can't handle. There is only so much memory to go around and having a namenode with a couple terabytes of memory seems ridiculous.
Is there a graceful way to process large binary files like this? Especially those which can't be split because we don't know what order the reducer will put them back together?
So not an answer as such, but i have so many questions that a list of comments would be more difficult to convey, so here goes:
You say you read the entire contents into memory for each map, are you able to elaborate on the actual binary input format of these files:
Do they contain logical records i.e. does a single input file represent a single record, or does it contain many records?
Are the files compressed (after-the-fact or some internal compression mechanism)?
How are you currently processing this file-at-once, what's you're overall ETL logic to convert to JSON?
Do you actually need to read the entire file read into memory before processing can begin or can you process once you have a buffer of some size populated (DOM vs SAX XML parsing for example).
My guess is that you can migrate some of your mapper logic to the record reader, and possibly even find a way to 'split' the file between multiple mappers. This would then allow you to address your scalability concerns.
To address some points in your question:
NameNode only requires memory to store information about the blocks (names, blocks[size, length, locations]). Assuming you assign it a decent memory footprint (GB's), there is no reason you can't have a cluster that holds Petabytes of data in HDFS storage (assuming you have enough physical storage)
Namenode doesn't have anything to do either with storage or processing.You should be concentrated on your Datanodes and Tasktrackers instead.Also I am not getting whether you are trying to address the storage issue or the processing of of your files here.If you are dealing with lots of Binary files, it is worth having a look at Hadoop SequenceFile. A SequenceFile is a flat file consisting of binary key/value pairs, hence extensively used in MapReduce as input/output formats. For a detailed explanation you can visit this page -
http://wiki.apache.org/hadoop/SequenceFile
When you have large binary files, use SequenceFile format as the input format and set the mapred input split size accordingly. You can set the number of mappers based on the total input size and the split size you had set. Hadoop will take care of splitting the input data.
If you have binary files compressed in some format, then hadoop cannot do this split. So the binary format has to be SequenceFile.

Resources