When I put a file into HDFS, for example
$ ./bin/hadoop/dfs -put /source/file input
Is the file compressed while storing?
Is the file encrypted while storing? Is there a config setting that we can specify to change whether it is encrypted or not?
There is no implicit compression in HDFS. In other words, if you want your data to be compressed, you have to write it that way. If you plan on writing map reduce jobs to process the compressed data, you'll want to use a splittable compression format.
Hadoop can process compressed files and here is a nice article on it. Also, the intermediate and the final MR output can be compressed.
There is a JIRA on 'Transparent compression in HDFS', but I don't see much progress on it.
I don't think there is a separate API for encryption, though you can you use a compression codec for encryption/decryption also. Here are more details about encryption and HDFS.
I very recently set compression up on a cluster. The other posts have helpful links, but the actual code you will want to get LZO compression working is here: https://github.com/kevinweil/hadoop-lzo.
You can, out of the box, use GZIP compression, BZIP2 compression, and Unix Compress. Just upload a file in one of those formats. When using the file as an input to a job, you will need to specify that the file is compressed as well as the proper CODEC. Here is an example for LZO compression.
-jobconf mapred.output.compress=true
-jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
Why am I going on an on about LZO compression? The cloudera article reference by Praveen goes into this. LZO compression is a splittable compression (unlike GZIP, for example). This means that a single file can be split into chunks to be handed off to a mapper. Without a splittable compressed file, a single mapper will receive the entire file. This may cause you to have too few mappers and to move too much data around your network.
BZIP2 is also splittable. It also has higher compression than LZO. However, it is very slow. LZO has a worse compression ratio than GZIP. However it is optimized to be extremely fast. In fact, it can even increase the performance of your job by minimizing disk I/O.
It takes a bit of work to set up, and is a bit of a pain to use, but it is worth it (transparent encryption would be awesome). Once again, the steps are:
Install LZO and LZOP (command-line utility)
Install hadoop-lzo
Upload a file compressed with LZOP.
Index the file as described by hadoop-lzo wiki (the index allows it to be split).
Run your job (with the proper parameters mapred.output.compress and mapred.output.compression.code)
Related
I have a 10 GB csv file and i want to process it in Hadoop MapReduce.
I have a 15 nodes(Datanode) cluster and i want to maximize the throughput.
What compression format should i use ? or Text file without compression will always give me better result over the compressed Text file. please explain the reason.
I used uncompressed file and it gave me better results over Snappy . Why is it so?
The problem with Snappy compression is that it is not splittable, so Hadoop can't divide input file into chunks and run several mappers for input. So most likely your 10Gb file is processed by a single mapper (check it in application history UI). Since hadoop stores big files in separate blocks on different machines, some parts of this file are not even located on the mapper machine and have to be transferred over the network. That seems to be the main reason why Snappy compressed file works slower than plain text.
To avoid the problem you can use bzip2 compression or divide the file into chunks manually and compress each part with snappy.
We're choosing the file format to store our raw logs, major requirements are compressed and splittable. Block-compressed (whichever codec) SequenceFiles and Hadoop-LZO look the most suitable so far.
Which one would be more efficient to be processed by Map-Reduce and easier to deal with overall?
For raw logs, it is recommended to use a container file format like SequenceFileFormat, which supports both compression and splitting. For storing the logs using this format, you will have to chose timestamp as the key and logged line as the value. In our team, we use SequenceFiles extensively.
For splittable LZO, you need to pre-process the files to generate the index. Without the index, the MapReduce framework will process the entire file as a single split (one mapper) and processing will be inefficient.
In "Hadoop The Definitive Guide" book (I suggest you read the section on "Compression"), there is a section recommending the compression format to use. As per the recommendation, following are the choices from most effective to least effective:
Container file formats like SequenceFile, Avro, ORCFiles, Parquet files with a fast compressor like LZO, LZ4 or Snappy
Compression format that supports splitting: bzip2 or splittable LZO
Split the file into chunks and compress each chunk separately using a compression format
I know and read many times Hadoop is not aware of what's inside the input file, and the split depends on the InputFileFormat, but let's be more specific ... for example, i read GZIP is not splittable, so if i have a unique gzipped input file of 1 TB, and no one of the node have an hd of that size, what happens? input will be splitted but hadoop will add info about the dependencies between one chunk and others? other question, if i have a huge .xml file, so basically text, how the split works, by line or by the configured MB of the block size?
BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.
LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.
LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.
ZSTD (zstandard) is now splittable as well in hadoop/Spark/Flink by leveraging hadoop-4mc.
Please have a look at Hadoop Elephant Bird to process complex input in your jobs. Anyways XML is not natively splittable in EB or hadoop, AFAIK.
I would like to use Hadoop Map/Reduce to process delimited Protocol Buffer files that are compressed using something other than LZO, e.g. xz or gzip. Twitter's elephant-bird library seems to mainly support reading protobuf files that are LZO compressed and thus doesn't seem to meet my needs. Is there an existing library or a standard approach to doing this?
(NOTE: As you can see by my choice of compression algorithms, it's not necessary for the solution to make the protobuf files splittable. Your answer doesn't even need to specify a particular compression algorithm, but should allow for at least one of the ones I mentioned.)
You may want to look into the RAgzip patch for Hadoop for processing multiple map tasks for a large gzipped file: RAgzip
I am using HBase to store a lot of sensor data.
I have tried to use a txt file to store my sensor data, for a 20MB file, if I compress it, it will reduce to 1MB on disk.
My question is: Does HBase itself do compression automatically when storing the data to disks?
Thanks
you can use lzo, gzip or snappy for hbase compression. You will need to set lzo/snappy yourself if you wish to use them for hbase compression (gzip is included).
normally - lzo is faster than gzip compression though gzip compression ratio normally be better. Snappy is robust with compression but compression ratios are normally worse.
When creating a table - you can specify compression/compression library - hfiles are compressed when written to disk if compression is used (and need to be decompressed when reading).
hope it helps
You can also alter your table to add compression support later. Then your data will be compressed for real at the next compaction (as ali said, because a new HFile will be written to disk).
As far as I understand, compression algorithm is used at the block-level, not at the whole HFile. That mean that when reading data, it won't have to uncompress a several-GBs HFile but only a few KBs data block.