File compression formats and container file formats - hadoop

It is generally said that any compression format like Gzip, when used along with a container file format like avro and sequence (file formats), will make the compression format splittable.
Does this mean that the blocks in the container format get compressed based on the preferred compression (like gzip) or something else. Can someone please explain this? Thanks!
Well, I think the question requires an update.
Update:
Do we have a straightforward approach to convert a large file in a non-splittable file compression format (like Gzip) into a splittable file (using a container file format such as Avro, Sequence or Parquet) to be processed by MapReduce?
Note: I do not mean to ask for workarounds such as uncompressing the file, and again compressing the data using a splittable compression format.

For Sequence files if you specify BLOCK compression, each block will be compressed using the specified compression codec. Blocks allow Hadoop to split data at the block level, while using compression (where the compression itself isn't splitable) and skip whole blocks without needing to decompress them.
Most of this is described on the Hadoop wiki: https://wiki.apache.org/hadoop/SequenceFile
Block compressed key/value records - both keys and values are
collected in 'blocks' separately and compressed. The size of the
'block' is configurable.
For Avro this is all very similar as well: https://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files
Objects are stored in blocks that may be compressed. Syncronization
markers are used between blocks to permit efficient splitting of files
for MapReduce processing.
Thus, each block's binary data can be efficiently extracted or skipped
without deserializing the contents.
The easiest (and usually fastest) way to convert data from one format into another is to let MapReduce do the work for you. In the example of:
GZip Text -> SequenceFile
You would have a map only job that uses the TextInputFormat for input and outputs SequenceFileFormat. This way you get a 1-to-1 conversion on the number of files (add a reduce step if this needs changing) and you do the conversion in parallel if there are lots of files to convert.

Do not know what you are really talking about... but any file can be splitted at any point.
Why i say this... hoping you are using something like Linux or similar.
On Linux it is (not too much) easy to create a block device that is really stored on the concatenation of some files.
I mean:
You split a file in as many chunks as you want, each of a different size, no need to be ood or even size, multiple of 512 bytes, etc., whatever size you want, mathematicaly expresed splitted_file_size=(desired_size mod 1).
You define a block device that concatenates all files in the correct order
You define a symbolic link to such device
That way you can have a BIG file (more than 16GiB, more than 4GiB) stored on one FAT32 partition (that has a limit of 4GiB-1 bytes per file)... and access it on-the-fly and transparently... thinking only on read.
For read/write... there is a trick (that is the complex part) that works:
Split the file (this time in chunks of N*512 bytes)
Define a device driver parametrized (so it knows how to allocate more chunks by creating more files)
On Linux i had used on the past some tools (command line) that do all the job, and they let you create a virtual container resizable on the fly, that will use files of an exact size (including the last one) and exposes it as a regular block device (where you can do a dd if=... of=... to fill it) and a virtual file associated with it.
That way you have:
Some not so big files of identical size
They will hold inside the real data of the stream
They are created / deleted as needed (grow / shrink or truncate)
They are exposed as a regular file on some point
Accesing such file will be as seen the concatenation
Maybe that gives you idea on other aproach to the problem you are having:
Instead of tweak the compression system, just put a layer (a little bit more complex that a simple loop device) that do on the fly and transparently the split/join
Such tools exist, i do not remember the name, sorry! But i remember the one for read only (dvd_double_layer.* are on a FAT32):
# cd /mnt/FAT32
# ls -lh dvd_double_layer.*
total #
-r--r--r-- 1 root root 3.5G 2017-04-20 13:10 dvd_double_layer.000
-r--r--r-- 1 root root 3.5G 2017-04-20 13:11 dvd_double_layer.001
-r--r--r-- 1 root root 0.2G 2017-04-20 13:12 dvd_double_layer.002
# affuse dvd_double_layer.000 /mnt/transparent_concatenated_on_the_fly
# cd /mnt/transparent_concatenated_on_the_fly
# ln -s dvd_double_layer.000.raw dvd_double_layer.iso
# ls -lh dvd_double_layer.*
total #
-r--r--r-- 1 root root 7.2G 2017-04-20 13:13 dvd_double_layer.000.raw
-r--r--r-- 1 root root 7.2G 2017-04-20 13:14 dvd_double_layer.iso
Hope this idea can help you.

Related

How to make Hadoop Map Reduce process multiple files in a single run ?

For Hadoop Map Reduce program when we run it by executing this command $hadoop jar my.jar DriverClass input1.txt hdfsDirectory. How to make Map Reduce process multiple files( input1.txt & input2.txt ) in a single run ?
Like that:
hadoop jar my.jar DriverClass hdfsInputDir hdfsOutputDir
where
hdfsInputDir is the path on HDFS where your input files are stored (i.e., the parent directory of input1.txt and input2.txt)
hdfsOutputDir is the path on HDFS where the output will be stored (it should not exist before running this command).
Note that your input should be copied on HDFS before running this command.
To copy it to HDFS, you can run:
hadoop dfs -copyFromLocal localPath hdfsInputDir
This is your small files problem. for every file mapper will run.
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
solution
HAR files
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. To a client using the HAR filesystem nothing has changed: all of the original files are visible and accessible (albeit using a har:// URL). However, the number of files in HDFS has been reduced.
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record).

Do we need to create an index file (with lzop) if compression type is RECORD instead of block?

As I understand, an index file is needed to make the output Splitable. If mapred.output.compression.type=SequenceFile.CompressionType.RECORD, do we still need to create an Index file?
Short answer:
RECORD and BLOCK compression.type properties apply to sequence files, not to simple text files (which can be independently compressed with lzo or gzip or bz2 ...)
More info:
LZO is a compression codec which gives better compression and decompression speed than gzip, and also the capability to split. LZO allows this because its composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries, as opposed to gzip where the dictionary for the whole file is written at the top.
When you specify mapred.output.compression.codec as LzoCodec, hadoop will generate .lzo_deflate files. These contain the raw compressed data without any header, and cannot be decompressed with lzop -d command. Hadoop can read these files in the map phase, but this makes your life hard.
When you specify LzopCodec as the compression.codec, hadoop will generate .lzo files. These contain the header and can be decompressed using lzop -d
However, neither .lzo nor .lzo_deflate files are splittable by default. This is where LzoIndexer comes into play. It generates an index file which tells you where the record boundary is. This way, multiple map tasks can process the same file.
See this cloudera blog post and LzoIndexer for more info.

BZip2 Native Splitting on Amazon/EMR

We have a question in specific regard to compressed input on an Amazon EMR Hadoop job.
According to AWS:
"Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you."
q.v., http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HowtoProcessGzippedFiles.html
Which seems good--however, looking into BZip2, it appears that the "split" boundaries would be file-based:
.magic:16 = 'BZ' signature/magic number
.version:8 = 'h' for Bzip2 ('H'uffman coding), '0' for Bzip1 (deprecated)
.hundred_k_blocksize:8 = '1'..'9' block-size 100 kB-900 kB (uncompressed)
**-->.compressed_magic:48 = 0x314159265359 (BCD (pi))**
.crc:32 = checksum for this block
.randomised:1 = 0=>normal, 1=>randomised (deprecated)
.origPtr:24 = starting pointer into BWT for after untransform
.huffman_used_map:16 = bitmap, of ranges of 16 bytes, present/not present
.huffman_used_bitmaps:0..256 = bitmap, of symbols used, present/not present (multiples of 16)
.huffman_groups:3 = 2..6 number of different Huffman tables in use
.selectors_used:15 = number of times that the Huffman tables are swapped (each 50 bytes)
*.selector_list:1..6 = zero-terminated bit runs (0..62) of MTF'ed Huffman table (*selectors_used)
.start_huffman_length:5 = 0..20 starting bit length for Huffman deltas
*.delta_bit_length:1..40 = 0=>next symbol; 1=>alter length
.contents:2..8 = Huffman encoded data stream until end of block
**-->.eos_magic:48 = 0x177245385090 (BCD sqrt(pi))**
.crc:32 = checksum for whole stream
.padding:0..7 = align to whole byte
With the statement: "Like gzip, bzip2 is only a data compressor. It is not an archiver like tar or ZIP; the program itself has no facilities for multiple files, encryption or archive-splitting, but, in the UNIX tradition, relies instead on separate external utilities such as tar and GnuPG for these tasks."
q.v., http://en.wikipedia.org/wiki/Bzip2
The combination of these two statements I interpret to mean that BZip2 is "split-able", but does so on a by-file basis . . . .
This is relevant, because our job will be receiving a single ~800MiB file via S3--which (if my interpretation is true) would mean that EC2/Hadoop would assign ONE Mapper to the job (for ONE file), which would be sub-optimal, to say the least.
(That being the case, we would obviously need to find a way to partition the input into a set of a 400 files before BZip2 is applied as a solution).
Does anyone know for certain if this is how AWS/EMR Hadoop jobs internally function?
Cheers!
Being splittable on file boundaries doesn't really mean anything since a .bz2 file doesn't have any concept of files.
A .bz2 stream consists of a 4-byte header, followed by zero or more compressed blocks
Compressed blocks are the key here. A .bz2 file can be split on block boundaries. So the number of splits you can create will depend on the size of a compressed block.
Edit (based on your comment):
A split boundary in Hadoop can often times occur half way through a record, whether or not the data is compressed. TextInputFormat will split on HDFS block boundaries. The trick is in the RecordReader.
Let's say we have split boundary in the middle of the 10th record in a file. The mapper that reads the first split will read up to the end of the 10th record, even though that record ends outside of the mappers allotted split. The second mapper then ignores the first partial record in it's split since it has already been read by the first mapper.
This only works if you can reliably find the end of a record if you are given an arbitrary byte offset into the record.
From what I understood so far, bzip2 files are potentially splittable, and you can do it in Hadoop, but it is still not specifically supported in AWS EMR, at least not "right away". So if you just run a job with a large bzip2 file, you are going to get a single mapper in the first step. I have just recently tried it. Apparently that is also what happens with indexed LZO files too, unless you do some black magic. I am not sure there is a corresponding black magic for splitting bzip2 files in EMR too.

Is there a way to store gzip's dictionary from a file?

I've been doing some research on compression-based text classification and I'm trying to figure out a way of storing a dictionary built by the encoder (on a training file) for use to run 'statically' on a test file? Is this at all possible using UNIX's gzip utility?
For example I have been using 2 'class' files of sport.txt and atheism.txt, hence I want to run compression on both of these files and store their dictionaries used. Next I want to take a test file (which is unlabelled, could be either atheism or sport) and by using the prebuilt dictionaries on this test.txt I can analyse how well it compresses under that dictionary/model.
Thanks
deflate encoders, as in gzip and zlib, do not "build" a dictionary. They simply use the previous 32K bytes as a source for potential matches to the string of bytes starting at the current position. The last 32K bytes is called the "dictionary", but the name is perhaps misleading.
You can use zlib to experiment with preset dictionaries. See the deflateSetDictionary() and inflateSetDictionary() functions. In that case, zlib compression is primed with a "dictionary" of 32K bytes that effectively precede the first byte being compressed as a source for matches, but the dictionary itself is not compressed. The priming can only improve the compression of the first 32K bytes. After that, the preset dictionary is too far back to provide matches.
gzip provides no support for preset dictionaries.

How Can I Use The Input Logs .PCAP(Binary) With Map Rreduce Hadoop

Logs Tcpdumps are binary files, I want to know what FileInputFormat of hadoop I should use for split chunks the input data...please help me!!
There was a thread on the user list about this:
http://hadoop.markmail.org/search/list:org%2Eapache%2Ehadoop%2Ecore-user+pcap+order:date-forward
Basically, the format is not splittable as you can't find a start of a record starting at an arbitrary offset in the file. So you have to do some preprocessing, inserting syncpoints or something similar. Maybe covert smaller files into sequencefiles, and then merge the small sequencefiles?
If you wind up writing something reusable, please consider contributing back to the project.
Write an InputFormat that reads PCAP files, returning something like LongWritable for the key (the nth packet in the file) and PacketWritable as the value (containing the PCAP data). For the InputSplit you can use FileSplit, or MultiFileSplit for better performance, as an individual PCAP file can be read surprisingly quickly.
Unless your blocksize is larger than the size of your pcap files, you will experience lots of network IO...
We've released a library for PCAP format files recently: https://github.com/RIPE-NCC/hadoop-pcap

Resources