Hadoop SequenceFile vs splittable LZO - hadoop

We're choosing the file format to store our raw logs, major requirements are compressed and splittable. Block-compressed (whichever codec) SequenceFiles and Hadoop-LZO look the most suitable so far.
Which one would be more efficient to be processed by Map-Reduce and easier to deal with overall?

For raw logs, it is recommended to use a container file format like SequenceFileFormat, which supports both compression and splitting. For storing the logs using this format, you will have to chose timestamp as the key and logged line as the value. In our team, we use SequenceFiles extensively.
For splittable LZO, you need to pre-process the files to generate the index. Without the index, the MapReduce framework will process the entire file as a single split (one mapper) and processing will be inefficient.
In "Hadoop The Definitive Guide" book (I suggest you read the section on "Compression"), there is a section recommending the compression format to use. As per the recommendation, following are the choices from most effective to least effective:
Container file formats like SequenceFile, Avro, ORCFiles, Parquet files with a fast compressor like LZO, LZ4 or Snappy
Compression format that supports splitting: bzip2 or splittable LZO
Split the file into chunks and compress each chunk separately using a compression format

Related

Lz4 compression is not splittable

I am using lz4 compression and write data to a hive table, this table has 20 files and each is 15G on HDFS, and every file name of this table are ending with lz4, eg, part-m-00000.lz4.
When I run select count(1) from this table, it kicks off only 20 mappers, which mean lz4 splittable doesn't take effect.
It is said that lz4 supports splittable against text file,so I would ask what I should do or additional steps to enable this.
Assuming you can have some control on how data is being compressed, this codec might be closer to what you need, since it embeds a splittable layer. It's designed for use with Hadoop.
If you can't change the format, and it was compressed as a single stream with no jump-table, then I'm afraid there is no good solution. lz4 CLI will, by default, split data into blocks of 4 MB, but does not provide any jump table. The jump table is what makes an archive easy to read in random order. Without it, it's necessary to stream the data, and distribute the blocks in order for later processing.

Hadoop split method

I know and read many times Hadoop is not aware of what's inside the input file, and the split depends on the InputFileFormat, but let's be more specific ... for example, i read GZIP is not splittable, so if i have a unique gzipped input file of 1 TB, and no one of the node have an hd of that size, what happens? input will be splitted but hadoop will add info about the dependencies between one chunk and others? other question, if i have a huge .xml file, so basically text, how the split works, by line or by the configured MB of the block size?
BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.
LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.
LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.
ZSTD (zstandard) is now splittable as well in hadoop/Spark/Flink by leveraging hadoop-4mc.
Please have a look at Hadoop Elephant Bird to process complex input in your jobs. Anyways XML is not natively splittable in EB or hadoop, AFAIK.

Pig and Java MR are only allocating one mapper per LZO file

I have a large amount of data in HDFS in LZO format. I have also indexed the LZO files. When I ran the Java MR Job or Load and process the LZO files using PIG, I see that only one mapper is being used per LZO file (Job completes with out any issues but slow). I have number of mappers configured to 50 in my hadoop config, but when I process LZO files I see only 10 mappers are used (one per lzo file). Is there any other configuration I should turn on?
Software versions:
Hadoop 1.0.4
Pig 0.11
Thanks.
LZO-compressed text files can not be processed in parallel because they are not splittable (i.e. if you read from an arbitrary point in the file you can't determine how to decompress the following compressed data), and so MapReduce is forced to read such a file in serial with a single mapper.
One way to deal with this is to pre-process the LZO text files to create LZO index files that MapReduce can use to split the text files and so process them in parallel.
A more effective approach is to convert the LZO text files into a splittable binary format such as Avro, Parquet, or SequenceFile. These formats allow various data compression codecs (note that Snappy is now much more popular than LZO), and can also provide other benefits such as fast serialization/deserialization, column pruning, and bundled metadata.
The book "Hadoop: The Definitive Guide" has a lot of information on this topic.

Storage format in HDFS

How Does HDFS store data?
I want to store huge files in a compressed fashion.
E.g : I have a 1.5 GB of file, with default replication factor of 3.
It requires (1.5)*3 = 4.5 GB of space.
I believe currently no implicit compression of data takes place.
Is there a technique to compress the file and store it in HDFS to save disk space ?
HDFS stores any file in a number of 'blocks'. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)
So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.
If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS - the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.
However, one thing to consider when using compression is whether the compression method supports splitting the file - that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).
Even if the method doesn't support splitting, hadoop will still store the file in a number of blocks, but you'll lose some benefit of 'data locality' as the blocks will most probably be spread around your cluster.
In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.
Hope this makes sense
EDIT Some additional info in response to comments:
When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:
setCompressOutput(Job, boolean)
setOutputCompressorClass(Job, Class)
When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)
Some time ago I tried to summarize that in a blog post here.
Essentially that is a question of data splittability, as a file is devided into blocks which are elementary blocks for replication. Name node is responsible for keeping track of all those blocks belonging to one file. It is essential that block is autonomous when choosing compression - not all codecs are splittable. If the format + codec is not splittable that means that in order to decompress it it needs to be in one place which has big impact on parallelism in mapreduce. Essentially running in single slot.
Hope that helps.
Have a look at presentation # Hadoop_Summit, especially Slide 6 and Slide 7.
If DFS block size is 128 MB, for 4.5 GB storage (including replication factor of 3), you need 35.15 ( ~36 blocks)
Only bzip2 file format is splittable. In other formats, all blocks of entire files are stored in same Datanode
Have a look at algorithm types and class names and codecs
#Chris White answer provides information on how to enable zipping while writing Map output
The answer to this question is to first understand the file format available in Hadoop today. There is now choice available within HDFS that can manage file format and compression techniques. Alternative to explicit encoding and splitting using LZO or BZIP. There is many format that today support block compression and columnar row compression with features.
A storage format is a way you define how information is to be stored. This is sometimes usually indicated by the extension of the file. For example we know images can be several storage formats, PNG, JPG, and GIF etc. All these formats can store the same image, but each has specific storage characteristics.
In Hadoop filesystem you have all of traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data.
Why is it important to know these formats
In any performance tradeoffs, a huge bottleneck for HDFS-enabled applications like MapReduce, Hive, HBase, and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are accentuated when you manage large datasets. The Hadoop file formats have evolved to ease these issues across a number of use cases.
Choosing an appropriate file format can have some significant benefits:
Optimum read time
Optimum write time
Spliting or partitioning of files (so you don’t need to read the whole file, just a part of it)
Schema adaption (allowing a field changes to a dataset) Compression support (without sacrificing these features)
Some file formats are designed for general use, others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice when storing data in Hadoop and one should know to optimally store data in HDFS. Currently my go to storage is ORC format.
Check if your Big data components (Spark, Hive, HBase etc) support these format and make the decision accordingly. For example, I am currently injecting data into Hive and converting it into ORC format which works for me in terms of compression and performance.
Some common storage formats for Hadoop include:
Plain text storage (eg, CSV, TSV files, Delimited file etc)
Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX world. Text-files are inherently splittable. but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2. This is not efficient and will require a bit of work when performing MapReduce tasks.
Sequence Files
Originally designed for MapReduce therefore very easy to integrate with Hadoop MapReduce processes. They encode a key and a value for each record and nothing more. Stored in a binary format that is smaller than a text-based format. Even here it doesn't encode the key and value in anyway. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. Though still not efficient as per statistics like Parquet and ORC.
Avro
The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Its file format with additional framework for, serialization and deserialization framework. With regular old sequence files you can store complex objects but you have to manage the process. It also supports block-level compression.
Parquet
My favorite and hot format these days. Its a columnar file storage structure while it encodes and writes to the disk. So datasets are partitioned both horizontally and vertically. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). Try using this if your processing can optimally use column storage. You can refer to advantages of columnar storages.
If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required.
ORC
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
It is similar to the Parquet but with different encoding technique. Its not for this thread but you can lookup on Google for differences.

Does HDFS encrypt or compress the data while storing?

When I put a file into HDFS, for example
$ ./bin/hadoop/dfs -put /source/file input
Is the file compressed while storing?
Is the file encrypted while storing? Is there a config setting that we can specify to change whether it is encrypted or not?
There is no implicit compression in HDFS. In other words, if you want your data to be compressed, you have to write it that way. If you plan on writing map reduce jobs to process the compressed data, you'll want to use a splittable compression format.
Hadoop can process compressed files and here is a nice article on it. Also, the intermediate and the final MR output can be compressed.
There is a JIRA on 'Transparent compression in HDFS', but I don't see much progress on it.
I don't think there is a separate API for encryption, though you can you use a compression codec for encryption/decryption also. Here are more details about encryption and HDFS.
I very recently set compression up on a cluster. The other posts have helpful links, but the actual code you will want to get LZO compression working is here: https://github.com/kevinweil/hadoop-lzo.
You can, out of the box, use GZIP compression, BZIP2 compression, and Unix Compress. Just upload a file in one of those formats. When using the file as an input to a job, you will need to specify that the file is compressed as well as the proper CODEC. Here is an example for LZO compression.
-jobconf mapred.output.compress=true
-jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
Why am I going on an on about LZO compression? The cloudera article reference by Praveen goes into this. LZO compression is a splittable compression (unlike GZIP, for example). This means that a single file can be split into chunks to be handed off to a mapper. Without a splittable compressed file, a single mapper will receive the entire file. This may cause you to have too few mappers and to move too much data around your network.
BZIP2 is also splittable. It also has higher compression than LZO. However, it is very slow. LZO has a worse compression ratio than GZIP. However it is optimized to be extremely fast. In fact, it can even increase the performance of your job by minimizing disk I/O.
It takes a bit of work to set up, and is a bit of a pain to use, but it is worth it (transparent encryption would be awesome). Once again, the steps are:
Install LZO and LZOP (command-line utility)
Install hadoop-lzo
Upload a file compressed with LZOP.
Index the file as described by hadoop-lzo wiki (the index allows it to be split).
Run your job (with the proper parameters mapred.output.compress and mapred.output.compression.code)

Resources