Does hadoop sequence file adds dummy data to it - hadoop

In my case i am trying to incorporate all the given image files to a hadoop sequence file to avoid the small files problem. Hence i first created the sequence file with the help of a mapper application. The key of each image file is the path to it and the corresponding value will be the byte array of the image file. While writing the image to the sequence file the size of the byte array of image is let's say 14k. Whenever i try to read the value from a sequence file(output of first mapper) with the help of another mapper the size of the read byte array drastically increases to let's say 500k. Don't know where the problem is. Please help me out with this.

Related

Each run of the same Hadoop SequenceFile creation routine creates a file with different crc. Is it ok?

I have a simple code which creates Hadoop's Sequence file. Each the code is ran it leaves in working dir two files:
mySequenceFile.txt
.mySequenceFile.txt.crc
After each run the sizes of both files remain the same. But the crc file contents become different!
Is this a bug or an expected behaviour?
This is a confusing, but expected behaviour.
According to SequenceFile standart, each sequencefile has a sync-block, its length is 16 bytes. The sync-block repeats after each record in block-compressed sequencefiles, and after some records or one very long record in uncompressed or record-compressed sequencefiles.
The thing is, that the sync-block is some sort of random value. It is written in the header, so this is how the reader recognizes it. It stays same within one sequencefile, but it can (and actually is) different from one sequencefile to another.
So the files are logically same, but binary different. CRC is binary shecksum, so its different between two files too.
I haven`t found any ways to manually set this sync-block. If someone gets the way, please write it here.

Is it possible to know the serial number of the block of input data on which map function is currently working?

I am a novice in Hadoop and here I have the following questions:
(1) As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
(2) Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
(3) Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
Any help would be appreciated.
As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
No. A block(split to be precise) gets processed by only one mapper.
Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
You can get some valuable info, like the file containing split's data, the position of the first byte in the file to process. etc, with the help of FileSplit class. You might find it helpful.
Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
You can do that by extending FileInputFormat class. To begin with you could do this :
In your getSplits() method maintain a counter. Now, as you read the file line by line keep on tokenizing them. Collect each token and increase the counter by 1. Once the counter reaches the desired value, emit the data read upto this point as one split. Reset the counter and start with the second split.
HTH
If you define a small max split size you can actually have multiple mappers processing a single HDFS block (say 32mb max split for a 128 MB block size - you'll get 4 mappers working on the same HDFS block). With the standard input formats, you'll typically never see two or more mappers processing the same part of the block (the same records).
MapContext.getInputSplit() can usually be cast to a FileSplit and then you have the Path, offset and length of the file being / block being processed).
If your input files are true text flies, then you can use the method suggested by Tariq, but note this is highly inefficient for larger data sources as the Job Client has to process each input file to discover the split locations (so you end up reading each file twice). If you really only want each mapper to process a set number of words, you could run a job to re-format the text files into sequence files (or another format), and write the records down to disk with a fixed number of words per file (using Multiple outputs to get a file per number of words, but this again is inefficient). Maybe if you shared the use case as for why you want a fixed number of words, we can better understand your needs and come up with alternatives

Custom input splits for streaming the data in MapReduce

I have a large data set that is ingested into HDFS as sequence files, with the key being the file metadata and value the entire file contents. I am using SequenceFileInputFormat and hence my splits are based on the sequence file sync points.
The issue I am facing is when I ingest really large files, I am basically loading the entire file in memory in the Mapper/Reducer as the value is the entire file content. I am looking for ways to stream the file contents while retaining the Sequence file container. I even thought about writing custom splits but not sure of how I will retain the sequence file container.
Any ideas would be helpful.
The custom split approach is not suitable to this scenario for the following 2 reasons. 1) Entire file is getting loaded to the Map node because the Map function needs entire file (as value = entire content). If you split the file, Map function receives only a partial record (value) and it would fail.2) Probably the sequence file container is treating your file as a 'single record' file. So, it would have only 1 sync point at max, that is after the Header. So, even if you retain the Sequence File Container's sync points, the whole file gets loaded to the Map node as it being loaded now.
I had the concerns regarding losing the sequence files sync points if writing a custom split. I was thinking of this approach of modifying the Sequence File Input Format/Record Reader to return chunks of the file contents as opposed to the entire file, but return the same key for every chunk.
The chunking strategy would be similar to how file splits are calculated in MapReduce.

Get Line number in map method using FileInputFormat

I was wondering whether it is possible to get the line number in my map method?
My input file is just a single column of values like,
Apple
Orange
Banana
Is it possible to get key: 1, Value: Apple , Key: 2, Value: Orange ... in my map method?
Using CDH3/CDH4. Changing the input data so as to use KeyValueInputFormat is not an option.
Thanks ahead.
The default behaviour of InputFormats such as TextInputFormat is to give the byte offset of the record rather than the actual line number - this is mainly due to being unable to determine the true line number when an input file is splittable and being processed by two or more mappers.
You could create your own InputFormat (based upon the TextInputFormat and associated LineRecordReader) to produce line numbers rather than byte offsets but you'd need to configure your input format to return false from the isSplittable method (meaning that a large input file would not be processed by multiple mappers). If you have small files, or files that are close in size the HDFS block size then this shouldn't be a problem. Also non-splittable compression formats (GZip .gz for example) means the entire file will be processed by a single mapper anyway.

How can I emit a list of values from a Mapper or Reducer?

I have a file that contains some geophysical data (seismic data) and I am reading these files from the local file system and storing them as Hadoop Sequential files in HDFS.
Now I want to write a MapReduce job that can read the values from these Sequential files and store them into an HBase table. These files are not simply flat files. Instead they consist of many pieces, where each piece is a block of 240 bytes and has several fields. Each field can either be a short or an integer. I am using the block number as the key and a byte array of 240 bytes (which contains all the fields) as the value of Sequential files. So each Sequential file has all the blocks as byte arrays and their block number.
My question is, while processing such a file, how can I read each 240 byte block, read individual fields and emit all the fields in one shot once a 240 bytes block is done? Suppose I have a file that has 1000 blocks. So in my MapReduce program I have to read these 1000 blocks one at a time, extract each field (short or int) and emit all the fields as the result of one Map.
I need some help, regarding this.
Just to make sure, you want to read each 240 bytes blocks, emit the block number as the key and the byte array as value? I think you'll have to extend the default SequenceFileInputFormat. I'm not exactly sure how Sequence File works, or what their structure is like (sorry), but I was trying to read entire contents of a file to emit as output value, and the way I did it was to extend FileInputFormat. Perhaps you can take a look at the source code for SequenceFileInputFormat and see if there is a way to make an InputSplit every 240 bytes (if your data is structured), or at some delimiter.
Hope this helps!

Resources