BZip2 Native Splitting on Amazon/EMR - hadoop

We have a question in specific regard to compressed input on an Amazon EMR Hadoop job.
According to AWS:
"Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you."
q.v., http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HowtoProcessGzippedFiles.html
Which seems good--however, looking into BZip2, it appears that the "split" boundaries would be file-based:
.magic:16 = 'BZ' signature/magic number
.version:8 = 'h' for Bzip2 ('H'uffman coding), '0' for Bzip1 (deprecated)
.hundred_k_blocksize:8 = '1'..'9' block-size 100 kB-900 kB (uncompressed)
**-->.compressed_magic:48 = 0x314159265359 (BCD (pi))**
.crc:32 = checksum for this block
.randomised:1 = 0=>normal, 1=>randomised (deprecated)
.origPtr:24 = starting pointer into BWT for after untransform
.huffman_used_map:16 = bitmap, of ranges of 16 bytes, present/not present
.huffman_used_bitmaps:0..256 = bitmap, of symbols used, present/not present (multiples of 16)
.huffman_groups:3 = 2..6 number of different Huffman tables in use
.selectors_used:15 = number of times that the Huffman tables are swapped (each 50 bytes)
*.selector_list:1..6 = zero-terminated bit runs (0..62) of MTF'ed Huffman table (*selectors_used)
.start_huffman_length:5 = 0..20 starting bit length for Huffman deltas
*.delta_bit_length:1..40 = 0=>next symbol; 1=>alter length
.contents:2..8 = Huffman encoded data stream until end of block
**-->.eos_magic:48 = 0x177245385090 (BCD sqrt(pi))**
.crc:32 = checksum for whole stream
.padding:0..7 = align to whole byte
With the statement: "Like gzip, bzip2 is only a data compressor. It is not an archiver like tar or ZIP; the program itself has no facilities for multiple files, encryption or archive-splitting, but, in the UNIX tradition, relies instead on separate external utilities such as tar and GnuPG for these tasks."
q.v., http://en.wikipedia.org/wiki/Bzip2
The combination of these two statements I interpret to mean that BZip2 is "split-able", but does so on a by-file basis . . . .
This is relevant, because our job will be receiving a single ~800MiB file via S3--which (if my interpretation is true) would mean that EC2/Hadoop would assign ONE Mapper to the job (for ONE file), which would be sub-optimal, to say the least.
(That being the case, we would obviously need to find a way to partition the input into a set of a 400 files before BZip2 is applied as a solution).
Does anyone know for certain if this is how AWS/EMR Hadoop jobs internally function?
Cheers!

Being splittable on file boundaries doesn't really mean anything since a .bz2 file doesn't have any concept of files.
A .bz2 stream consists of a 4-byte header, followed by zero or more compressed blocks
Compressed blocks are the key here. A .bz2 file can be split on block boundaries. So the number of splits you can create will depend on the size of a compressed block.
Edit (based on your comment):
A split boundary in Hadoop can often times occur half way through a record, whether or not the data is compressed. TextInputFormat will split on HDFS block boundaries. The trick is in the RecordReader.
Let's say we have split boundary in the middle of the 10th record in a file. The mapper that reads the first split will read up to the end of the 10th record, even though that record ends outside of the mappers allotted split. The second mapper then ignores the first partial record in it's split since it has already been read by the first mapper.
This only works if you can reliably find the end of a record if you are given an arbitrary byte offset into the record.

From what I understood so far, bzip2 files are potentially splittable, and you can do it in Hadoop, but it is still not specifically supported in AWS EMR, at least not "right away". So if you just run a job with a large bzip2 file, you are going to get a single mapper in the first step. I have just recently tried it. Apparently that is also what happens with indexed LZO files too, unless you do some black magic. I am not sure there is a corresponding black magic for splitting bzip2 files in EMR too.

Related

File compression formats and container file formats

It is generally said that any compression format like Gzip, when used along with a container file format like avro and sequence (file formats), will make the compression format splittable.
Does this mean that the blocks in the container format get compressed based on the preferred compression (like gzip) or something else. Can someone please explain this? Thanks!
Well, I think the question requires an update.
Update:
Do we have a straightforward approach to convert a large file in a non-splittable file compression format (like Gzip) into a splittable file (using a container file format such as Avro, Sequence or Parquet) to be processed by MapReduce?
Note: I do not mean to ask for workarounds such as uncompressing the file, and again compressing the data using a splittable compression format.
For Sequence files if you specify BLOCK compression, each block will be compressed using the specified compression codec. Blocks allow Hadoop to split data at the block level, while using compression (where the compression itself isn't splitable) and skip whole blocks without needing to decompress them.
Most of this is described on the Hadoop wiki: https://wiki.apache.org/hadoop/SequenceFile
Block compressed key/value records - both keys and values are
collected in 'blocks' separately and compressed. The size of the
'block' is configurable.
For Avro this is all very similar as well: https://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files
Objects are stored in blocks that may be compressed. Syncronization
markers are used between blocks to permit efficient splitting of files
for MapReduce processing.
Thus, each block's binary data can be efficiently extracted or skipped
without deserializing the contents.
The easiest (and usually fastest) way to convert data from one format into another is to let MapReduce do the work for you. In the example of:
GZip Text -> SequenceFile
You would have a map only job that uses the TextInputFormat for input and outputs SequenceFileFormat. This way you get a 1-to-1 conversion on the number of files (add a reduce step if this needs changing) and you do the conversion in parallel if there are lots of files to convert.
Do not know what you are really talking about... but any file can be splitted at any point.
Why i say this... hoping you are using something like Linux or similar.
On Linux it is (not too much) easy to create a block device that is really stored on the concatenation of some files.
I mean:
You split a file in as many chunks as you want, each of a different size, no need to be ood or even size, multiple of 512 bytes, etc., whatever size you want, mathematicaly expresed splitted_file_size=(desired_size mod 1).
You define a block device that concatenates all files in the correct order
You define a symbolic link to such device
That way you can have a BIG file (more than 16GiB, more than 4GiB) stored on one FAT32 partition (that has a limit of 4GiB-1 bytes per file)... and access it on-the-fly and transparently... thinking only on read.
For read/write... there is a trick (that is the complex part) that works:
Split the file (this time in chunks of N*512 bytes)
Define a device driver parametrized (so it knows how to allocate more chunks by creating more files)
On Linux i had used on the past some tools (command line) that do all the job, and they let you create a virtual container resizable on the fly, that will use files of an exact size (including the last one) and exposes it as a regular block device (where you can do a dd if=... of=... to fill it) and a virtual file associated with it.
That way you have:
Some not so big files of identical size
They will hold inside the real data of the stream
They are created / deleted as needed (grow / shrink or truncate)
They are exposed as a regular file on some point
Accesing such file will be as seen the concatenation
Maybe that gives you idea on other aproach to the problem you are having:
Instead of tweak the compression system, just put a layer (a little bit more complex that a simple loop device) that do on the fly and transparently the split/join
Such tools exist, i do not remember the name, sorry! But i remember the one for read only (dvd_double_layer.* are on a FAT32):
# cd /mnt/FAT32
# ls -lh dvd_double_layer.*
total #
-r--r--r-- 1 root root 3.5G 2017-04-20 13:10 dvd_double_layer.000
-r--r--r-- 1 root root 3.5G 2017-04-20 13:11 dvd_double_layer.001
-r--r--r-- 1 root root 0.2G 2017-04-20 13:12 dvd_double_layer.002
# affuse dvd_double_layer.000 /mnt/transparent_concatenated_on_the_fly
# cd /mnt/transparent_concatenated_on_the_fly
# ln -s dvd_double_layer.000.raw dvd_double_layer.iso
# ls -lh dvd_double_layer.*
total #
-r--r--r-- 1 root root 7.2G 2017-04-20 13:13 dvd_double_layer.000.raw
-r--r--r-- 1 root root 7.2G 2017-04-20 13:14 dvd_double_layer.iso
Hope this idea can help you.

Each run of the same Hadoop SequenceFile creation routine creates a file with different crc. Is it ok?

I have a simple code which creates Hadoop's Sequence file. Each the code is ran it leaves in working dir two files:
mySequenceFile.txt
.mySequenceFile.txt.crc
After each run the sizes of both files remain the same. But the crc file contents become different!
Is this a bug or an expected behaviour?
This is a confusing, but expected behaviour.
According to SequenceFile standart, each sequencefile has a sync-block, its length is 16 bytes. The sync-block repeats after each record in block-compressed sequencefiles, and after some records or one very long record in uncompressed or record-compressed sequencefiles.
The thing is, that the sync-block is some sort of random value. It is written in the header, so this is how the reader recognizes it. It stays same within one sequencefile, but it can (and actually is) different from one sequencefile to another.
So the files are logically same, but binary different. CRC is binary shecksum, so its different between two files too.
I haven`t found any ways to manually set this sync-block. If someone gets the way, please write it here.

Do we need to create an index file (with lzop) if compression type is RECORD instead of block?

As I understand, an index file is needed to make the output Splitable. If mapred.output.compression.type=SequenceFile.CompressionType.RECORD, do we still need to create an Index file?
Short answer:
RECORD and BLOCK compression.type properties apply to sequence files, not to simple text files (which can be independently compressed with lzo or gzip or bz2 ...)
More info:
LZO is a compression codec which gives better compression and decompression speed than gzip, and also the capability to split. LZO allows this because its composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries, as opposed to gzip where the dictionary for the whole file is written at the top.
When you specify mapred.output.compression.codec as LzoCodec, hadoop will generate .lzo_deflate files. These contain the raw compressed data without any header, and cannot be decompressed with lzop -d command. Hadoop can read these files in the map phase, but this makes your life hard.
When you specify LzopCodec as the compression.codec, hadoop will generate .lzo files. These contain the header and can be decompressed using lzop -d
However, neither .lzo nor .lzo_deflate files are splittable by default. This is where LzoIndexer comes into play. It generates an index file which tells you where the record boundary is. This way, multiple map tasks can process the same file.
See this cloudera blog post and LzoIndexer for more info.

How to use "typedbytes" or "rawbytes" in Hadoop Streaming?

I have a problem that would be solved by Hadoop Streaming in "typedbytes" or "rawbytes" mode, which allow one to analyze binary data in a language other than Java. (Without this, Streaming interprets some characters, usually \t and \n, as delimiters and complains about non-utf-8 characters. Converting all my binary data to Base64 would slow down the workflow, defeating the purpose.)
These binary modes were added by HADOOP-1722. On the command line that invokes a Hadoop Streaming job, "-io rawbytes" lets you define your data as a 32-bit integer size followed by raw data of that size, and "-io typedbytes" lets you define your data as a 1-bit zero (which means raw bytes), followed by a 32-bit integer size, followed by raw data of that size. I have created files with these formats (with one or many records) and verified that they are in the right format by checking them with/against the output of typedbytes.py. I've also tried all conceivable variations (big-endian, little-endian, different byte offsets, etc.). I'm using Hadoop 0.20 from CDH4, which has the classes that implement the typedbytes handling, and it is entering those classes when the "-io" switch is set.
I copied the binary file to HDFS with "hadoop fs -copyFromLocal". When I try to use it as input to a map-reduce job, it fails with an OutOfMemoryError on the line where it tries to make a byte array of the length I specify (e.g. 3 bytes). It must be reading the number incorrectly and trying to allocate a huge block instead. Despite this, it does manage to get a record to the mapper (the previous record? not sure), which writes it to standard error so that I can see it. There are always too many bytes at the beginning of the record: for instance, if the file is "\x00\x00\x00\x00\x03hey", the mapper would see "\x04\x00\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\x00\x03hey" (reproducible bits, though no pattern that I can see).
From page 5 of this talk, I learned that there are "loadtb" and "dumptb" subcommands of streaming, which copy to/from HDFS and wrap/unwrap the typed bytes in a SequenceFile, in one step. When used with "-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat", Hadoop correctly unpacks the SequenceFile, but then misinterprets the typedbytes contained within, in exactly the same way.
Moreover, I can find no documentation of this feature. On Feb 7 (I e-mailed it to myself), it was briefly mentioned in the streaming.html page on Apache, but this r0.21.0 webpage has since been taken down and the equivalent page for r1.1.1 has no mention of rawbytes or typedbytes.
So my question is: what is the correct way to use rawbytes or typedbytes in Hadoop Streaming? Has anyone ever gotten it to work? If so, could someone post a recipe? It seems like this would be a problem for anyone who wants to use binary data in Hadoop Streaming, which ought to be a fairly broad group.
P.S. I noticed that Dumbo, Hadoopy, and rmr all use this feature, but there ought to be a way to use it directly, without being mediated by a Python-based or R-based framework.
Okay, I've found a combination that works, but it's weird.
Prepare a valid typedbytes file in your local filesystem, following the documentation or by imitating typedbytes.py.
Use
hadoop jar path/to/streaming.jar loadtb path/on/HDFS.sequencefile < local/typedbytes.tb
to wrap the typedbytes in a SequenceFile and put it in HDFS, in one step.
Use
hadoop jar path/to/streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat ...
to run a map-reduce job in which the mapper gets input from the SequenceFile. Note that -io typedbytes or -D stream.map.input=typedbytes should not be used--- explicitly asking for typedbytes leads to the misinterpretation I described in my question. But fear not: Hadoop Streaming splits the input on its binary record boundaries and not on its '\n' characters. The data arrive in the mapper as "rawdata" separated by '\t' and '\n', like this:
32-bit signed integer, representing length (note: no type character)
block of raw binary with that length: this is the key
'\t' (tab character... why?)
32-bit signed integer, representing length
block of raw binary with that length: this is the value
'\n' (newline character... ?)
If you want to additionally send raw data from mapper to reducer, add
-D stream.map.output=typedbytes -D stream.reduce.input=typedbytes
to your Hadoop command line and format the mapper's output and reducer's expected input as valid typedbytes. They also alternate for key-value pairs, but this time with type characters and without '\t' and '\n'. Hadoop Streaming correctly splits these pairs on their binary record boundaries and groups by keys.
The only documentation on stream.map.output and stream.reduce.input that I could find was in the HADOOP-1722 exchange, starting 6 Feb 09. (Earlier discussion considered a different way to parameterize the formats.)
This recipe does not provide strong typing for the input: the type characters are lost somewhere in the process of creating a SequenceFile and interpreting it with the -inputformat. It does, however, provide splitting at the binary record boundaries, rather than '\n', which is the really important thing, and strong typing between the mapper and the reducer.
We solved the binary data issue using hexaencoding the data at split level when streaming down data to the Mapper. This would utilize and increase the Parallel efficiency of your operation instead of first tranforming your data before processing on a node.
Apparently there is a patch for a JustBytes IO mode for streaming, that feeds a whole input file to the mapper command:
https://issues.apache.org/jira/browse/MAPREDUCE-5018

Hadoop: Mapping binary files

Typically in a the input file is capable of being partially read and processed by Mapper function (as in text files). Is there anything that can be done to handle binaries (say images, serialized objects) which would require all the blocks to be on same host, before the processing can start.
Stick your images into a SequenceFile; then you will be able to process them iteratively, using map-reduce.
To be a bit less cryptic: Hadoop does not natively know anything about text and not-text. It just has a class that knows how to open an input stream (hdfs handles sticthing together blocks on different nodes, to make them appear as one large files). On top of that, you have an Reader and an InputFormat that knows how to determine where in that stream records start, where they end, and how to find the beginning of the next record if you are dropped somewhere in the middle of the file. TextInputFormat is just one implementation, which treats newlines as record delimiter. There is also a special format called a SequenceFile that you can write arbitrary binary records into, and then get them back out. Use that.

Resources