Hbase: How does reading of compressed data work? - hadoop

I understand that if we enable compression for column family then the values of hfiles saved will be compressed using the algorithm we specify and compression happens at block level.
But when reading the data block out how does this decompression work. For eg. If we use snappy how does the read happen. How does it figure out how to read a specific key value from the compressed data? How is this done in a real time basis.

The data is decompressed in memory 'on the fly' and the value is then read in uncompressed format. This is done quickly as using cpu to uncompress is faster than doing large reads from file.
The decompression happens while the file is being read from disk.

Related

Lz4 compression is not splittable

I am using lz4 compression and write data to a hive table, this table has 20 files and each is 15G on HDFS, and every file name of this table are ending with lz4, eg, part-m-00000.lz4.
When I run select count(1) from this table, it kicks off only 20 mappers, which mean lz4 splittable doesn't take effect.
It is said that lz4 supports splittable against text file,so I would ask what I should do or additional steps to enable this.
Assuming you can have some control on how data is being compressed, this codec might be closer to what you need, since it embeds a splittable layer. It's designed for use with Hadoop.
If you can't change the format, and it was compressed as a single stream with no jump-table, then I'm afraid there is no good solution. lz4 CLI will, by default, split data into blocks of 4 MB, but does not provide any jump table. The jump table is what makes an archive easy to read in random order. Without it, it's necessary to stream the data, and distribute the blocks in order for later processing.

Hadoop Mapreduce with compressed/encrypted files (file of large size)

I have hdfs cluster which stores large csv files in a compressed/encrypted form as selected by end user.
For compression, encryption, I have create a wrapper input stream which feed data to HDFS in compressed/encrypted form. Compression format used GZ, Encryption format AES256.
A 4.4GB csv file is compressed to 40MB on HDFS.
Now I have mapreduce job(java) which processes multiple compressed files together. MR job uses FileInputFormat.
When splits are calculated by mapper, 4.4GB compressed file(40MB) is allocated only 1 mapper with split start as 0 and split length equivalent 40MB.
How do I process such compressed file of larger size.? One option I found was to implement custom RecordReader and use wrapper input stream to read uncompressed data and process it.
Since I don't have actual length of the file, so I don't know how much data to read from input stream.
If I read upto end from InputStream, then how to handle when 2 mappers are allocated to same file as explained below.
If compressed file size is larger than 64MB, then 2 mappers wil be allocated for same file.
How to handle this scenario.?
Hadoop Version - 2.7.1
The compression format should be decided keeping in mind if the file would be processed by map reduce. Because, is the compression format is splittable, then map reduce works normally.
However, if not splittable(in your case gzip is not splittable, and map reduce will know it), then entire file would be processed in one mapper. This will serve the purpose, but will have data locality issues, as one mapper will only perform the job, and it fetches the data from other blocks.
From Hadoop definitive guide:
"For large files, you should not use a compression format that does not support splitting on the whole file, because you lose locality and make MapReduce applications very inefficient".
You can refer to the section compression in Hadoop I/O chapter, for more information.

Compression in HBase

I am using HBase to store a lot of sensor data.
I have tried to use a txt file to store my sensor data, for a 20MB file, if I compress it, it will reduce to 1MB on disk.
My question is: Does HBase itself do compression automatically when storing the data to disks?
Thanks
you can use lzo, gzip or snappy for hbase compression. You will need to set lzo/snappy yourself if you wish to use them for hbase compression (gzip is included).
normally - lzo is faster than gzip compression though gzip compression ratio normally be better. Snappy is robust with compression but compression ratios are normally worse.
When creating a table - you can specify compression/compression library - hfiles are compressed when written to disk if compression is used (and need to be decompressed when reading).
hope it helps
You can also alter your table to add compression support later. Then your data will be compressed for real at the next compaction (as ali said, because a new HFile will be written to disk).
As far as I understand, compression algorithm is used at the block-level, not at the whole HFile. That mean that when reading data, it won't have to uncompress a several-GBs HFile but only a few KBs data block.

Storage format in HDFS

How Does HDFS store data?
I want to store huge files in a compressed fashion.
E.g : I have a 1.5 GB of file, with default replication factor of 3.
It requires (1.5)*3 = 4.5 GB of space.
I believe currently no implicit compression of data takes place.
Is there a technique to compress the file and store it in HDFS to save disk space ?
HDFS stores any file in a number of 'blocks'. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)
So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.
If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS - the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.
However, one thing to consider when using compression is whether the compression method supports splitting the file - that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).
Even if the method doesn't support splitting, hadoop will still store the file in a number of blocks, but you'll lose some benefit of 'data locality' as the blocks will most probably be spread around your cluster.
In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.
Hope this makes sense
EDIT Some additional info in response to comments:
When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:
setCompressOutput(Job, boolean)
setOutputCompressorClass(Job, Class)
When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)
Some time ago I tried to summarize that in a blog post here.
Essentially that is a question of data splittability, as a file is devided into blocks which are elementary blocks for replication. Name node is responsible for keeping track of all those blocks belonging to one file. It is essential that block is autonomous when choosing compression - not all codecs are splittable. If the format + codec is not splittable that means that in order to decompress it it needs to be in one place which has big impact on parallelism in mapreduce. Essentially running in single slot.
Hope that helps.
Have a look at presentation # Hadoop_Summit, especially Slide 6 and Slide 7.
If DFS block size is 128 MB, for 4.5 GB storage (including replication factor of 3), you need 35.15 ( ~36 blocks)
Only bzip2 file format is splittable. In other formats, all blocks of entire files are stored in same Datanode
Have a look at algorithm types and class names and codecs
#Chris White answer provides information on how to enable zipping while writing Map output
The answer to this question is to first understand the file format available in Hadoop today. There is now choice available within HDFS that can manage file format and compression techniques. Alternative to explicit encoding and splitting using LZO or BZIP. There is many format that today support block compression and columnar row compression with features.
A storage format is a way you define how information is to be stored. This is sometimes usually indicated by the extension of the file. For example we know images can be several storage formats, PNG, JPG, and GIF etc. All these formats can store the same image, but each has specific storage characteristics.
In Hadoop filesystem you have all of traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data.
Why is it important to know these formats
In any performance tradeoffs, a huge bottleneck for HDFS-enabled applications like MapReduce, Hive, HBase, and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are accentuated when you manage large datasets. The Hadoop file formats have evolved to ease these issues across a number of use cases.
Choosing an appropriate file format can have some significant benefits:
Optimum read time
Optimum write time
Spliting or partitioning of files (so you don’t need to read the whole file, just a part of it)
Schema adaption (allowing a field changes to a dataset) Compression support (without sacrificing these features)
Some file formats are designed for general use, others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice when storing data in Hadoop and one should know to optimally store data in HDFS. Currently my go to storage is ORC format.
Check if your Big data components (Spark, Hive, HBase etc) support these format and make the decision accordingly. For example, I am currently injecting data into Hive and converting it into ORC format which works for me in terms of compression and performance.
Some common storage formats for Hadoop include:
Plain text storage (eg, CSV, TSV files, Delimited file etc)
Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX world. Text-files are inherently splittable. but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2. This is not efficient and will require a bit of work when performing MapReduce tasks.
Sequence Files
Originally designed for MapReduce therefore very easy to integrate with Hadoop MapReduce processes. They encode a key and a value for each record and nothing more. Stored in a binary format that is smaller than a text-based format. Even here it doesn't encode the key and value in anyway. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. Though still not efficient as per statistics like Parquet and ORC.
Avro
The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Its file format with additional framework for, serialization and deserialization framework. With regular old sequence files you can store complex objects but you have to manage the process. It also supports block-level compression.
Parquet
My favorite and hot format these days. Its a columnar file storage structure while it encodes and writes to the disk. So datasets are partitioned both horizontally and vertically. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). Try using this if your processing can optimally use column storage. You can refer to advantages of columnar storages.
If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required.
ORC
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
It is similar to the Parquet but with different encoding technique. Its not for this thread but you can lookup on Google for differences.

Does HDFS encrypt or compress the data while storing?

When I put a file into HDFS, for example
$ ./bin/hadoop/dfs -put /source/file input
Is the file compressed while storing?
Is the file encrypted while storing? Is there a config setting that we can specify to change whether it is encrypted or not?
There is no implicit compression in HDFS. In other words, if you want your data to be compressed, you have to write it that way. If you plan on writing map reduce jobs to process the compressed data, you'll want to use a splittable compression format.
Hadoop can process compressed files and here is a nice article on it. Also, the intermediate and the final MR output can be compressed.
There is a JIRA on 'Transparent compression in HDFS', but I don't see much progress on it.
I don't think there is a separate API for encryption, though you can you use a compression codec for encryption/decryption also. Here are more details about encryption and HDFS.
I very recently set compression up on a cluster. The other posts have helpful links, but the actual code you will want to get LZO compression working is here: https://github.com/kevinweil/hadoop-lzo.
You can, out of the box, use GZIP compression, BZIP2 compression, and Unix Compress. Just upload a file in one of those formats. When using the file as an input to a job, you will need to specify that the file is compressed as well as the proper CODEC. Here is an example for LZO compression.
-jobconf mapred.output.compress=true
-jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
Why am I going on an on about LZO compression? The cloudera article reference by Praveen goes into this. LZO compression is a splittable compression (unlike GZIP, for example). This means that a single file can be split into chunks to be handed off to a mapper. Without a splittable compressed file, a single mapper will receive the entire file. This may cause you to have too few mappers and to move too much data around your network.
BZIP2 is also splittable. It also has higher compression than LZO. However, it is very slow. LZO has a worse compression ratio than GZIP. However it is optimized to be extremely fast. In fact, it can even increase the performance of your job by minimizing disk I/O.
It takes a bit of work to set up, and is a bit of a pain to use, but it is worth it (transparent encryption would be awesome). Once again, the steps are:
Install LZO and LZOP (command-line utility)
Install hadoop-lzo
Upload a file compressed with LZOP.
Index the file as described by hadoop-lzo wiki (the index allows it to be split).
Run your job (with the proper parameters mapred.output.compress and mapred.output.compression.code)

Resources