Two files of same size are having totally different size after compression using 7zip - 7zip

Two csv files of 3 GB each are compressed using 7zip. After compression one file is having size of 242 MB and other is of 141 MB. How this is possible?
Both files contains same data format. However, data can be different.
Please let me know is anything which is the reason of higher compressed file size i.e. 242 MB.

I'm not an expert of compression, but I think that it depends strongly of the content before compression. For example, if there are many repetitions of sequences in one file, it will be easy to compress it, but if there are few, the compression won't be able to do as well.
For example, a 3GB text file containing the same words many times will be significantly smaller after compression, but a ZIP file of 3GB will not become smaller and may event be larger after compressing it.

Content of files is significant in compression. Basicaly the more redundant data is in file the smaller this file can be after compression. Here is really nice explanation how compression works: https://computer.howstuffworks.com/file-compression.htm

Related

Spark not same input/ouput directory size (for same data)

In order to reduce the number of blocks allocated by the NameNode. I'm trying to concatenate some small files to 128MB files. These small files are in gz format and the 128MB files must be in gz format too.
To accomplish this, I'm getting the sum size of all small files and divide this sum size(in MB) by 128 to get the number of files I need.
Then I perform a rdd.repartition(nbFiles).saveAsTextFile(PATH,classOf[GzipCodec])
The problem is that my output directory size is higher thant my input directory size (10% higher). I tested with default and best compression level and I'm always getting an higher output size.
I have no idea why my output directory is getting higher than my input directory but I imagine it's linked to the fact that i'm repartitioning all the files of the input directory.
Can someone help me to understand why i'm getting this result?
Thanks :)
Level of compression will depend on the data distribution. When you rdd.repartition(nbFiles) you randomly shuffle all the data so if there was some structure in the input, which reduced entropy and enabled better compression, it will be lost.
You can try some other approach, like colaesce without shuffle or sorting to see if you can get a better result.

Compressed file VS uncompressed file in mapreduce. which one gives better performence?

I have a 10 GB csv file and i want to process it in Hadoop MapReduce.
I have a 15 nodes(Datanode) cluster and i want to maximize the throughput.
What compression format should i use ? or Text file without compression will always give me better result over the compressed Text file. please explain the reason.
I used uncompressed file and it gave me better results over Snappy . Why is it so?
The problem with Snappy compression is that it is not splittable, so Hadoop can't divide input file into chunks and run several mappers for input. So most likely your 10Gb file is processed by a single mapper (check it in application history UI). Since hadoop stores big files in separate blocks on different machines, some parts of this file are not even located on the mapper machine and have to be transferred over the network. That seems to be the main reason why Snappy compressed file works slower than plain text.
To avoid the problem you can use bzip2 compression or divide the file into chunks manually and compress each part with snappy.

Gzip: merge a set of small files (<64mb) into several larger files (64mb or 128mb)

I have around 14000 small .gz files (~from 90kb to 4mb) which are loaded into HDFS all in the same directory.
So the size of each of them is far away from the standard 64mb or 128mb block size of HDFS, which can lead to serious trouble (the "small files problem", see this blog post by cloudera) when running MR jobs which process these files.
The aforementioned blog post contains a number of solutions to this problem, which mostly involve writing a MapReduce Job or using Hadoop Archives (HAR).
However, I would like to tackle the problem at the source and merge the small files into 64mb or 128mb .gz files which will then be fed directly into HDFS.
What's the simplest way of doing this?
cat small-*.gz > large.gz
should be enough. Assuming you don't need to extract separate files from there, and the data is enough.
If you want separate files, just tar it:
tar cf large.tar small-*.gz
After experimenting a bit further, the following two steps do what I want:
zcat small-*.gz | split -d -l2000000 -a 3 - large_
This works in my case, because there is very little variance in the length of a line.
2000000 lines correspond to almost exactly 300Mb files.
Unfortunately, for some reason, gzip cannot be piped like this, so I have to do another step:
gzip *
This will then also compress the generated large files.
Gzip compresses each of these files by a factor of ~5, leading to 60mb files and thus satisfying my initial constraint of receiving .gz files < 64mb.

Hadoop MR: better to have compressed input files or raw files?

as can be derived from the question, I want to know when it makes sense to have input files in compressed format (like gzip) and when it makes sense to have input files in uncompressed format.
What is the overhead of having compressed files? Is it much slower when reading the file? Are there any benchmarks done on big input files?
Thx!
It mostly makes sense to have input files in compressed format unless you are doing development and you need to frequently read data from HDFS to local file system for working on it.
Compressed format provides significant advantage. The data is already replicated in Hadoop cluster unless you set it other wise. Replicated data is good redundancy but consumes more space. If all your data is replicated with a factor of 3, you are going to consume 3 times the capacity required to store it.
Compression on textual data like log data is very effective as it yield high compression ratio. This is also the kind of data that you usually find more often in Hadoop cluster.
I don't have benchmarks but I have not seen any significant penalty on a decent sized cluster and data that we have.
How ever, for time being choose LZO over gzip.
See: LZO compression and it's significance over gzip
Gzip compresses better than LZO. LZO is faster at compressing and uncompressing. It is possible to split Lzo files, splittable Gzip is not available but I have seen Jira tasks for the same. (Also for bzip2)
Lets put reasons to compress vs reasons not to compress.
For:
a) Data is mostly stored and not frequently processed. It is usual DWH scenario. In this case space saving can be much more significant then processing overhead
b) Compression factor is very high and thereof we save a lot of IO.
c) Decompression is very fast (like Snappy) and thereof we have a some gain with little price
d) Data already arrived compressed
Against:
a) Compressed data is not splittable. Have to be noted that many modern format are built with block level compression to enable splitting and other partial processing of the files.
b) Data is created in the cluster and compression takes significant time. Have to be noted that compression usually much more CPU intensive then decompression.
c) Data has little redundancy and compression gives little gain.
1) Compressing input files
If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. This time conservation is beneficial to the performance of job execution.
If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use. For example, a file ending in .gz can be identified as gzip-compressed file and thus read with GzipCodec.
2) Compressing output files
Often we need to store the output as history files. If the amount of output per day is extensive, and we often need to store history results for future use, then these accumulated results will take extensive amount of HDFS space. However, these history files may not be used very frequently, resulting in a waste of HDFS space. Therefore, it is necessary to compress the output before storing on HDFS.
3) Compressing map output
Even if your MapReduce application reads and writes uncompressed data, it may benefit from compressing the intermediate output of the map phase. Since the map output is written to disk and transferred across the network to the reducer nodes, by using a fast compressor such as LZO or Snappy, you can get performance gains simply because the volume of data to transfer is reduced.
2. Common input format
gzip:
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding.
bzip2:
bzip2 is a freely available, patent free (see below), high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), whilst being around twice as fast at compression and six times faster at decompression.
LZO:
The LZO compression format is composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries. Moreover, it was designed with speed in mind: it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive read speeds. It doesn’t compress quite as well as gzip — expect files that are on the order of 50% larger than their gzipped version. But that is still 20-50% of the size of the files without any compression at all, which means that IO-bound jobs complete the map phase about four times faster.
Snappy:
Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems.
Some tradeoffs:
All compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of smaller space savings. The tools listed in above table typically give some control over this trade-off at compression time by offering nine different options: –1 means optimize for speed and -9 means optimize for space.
The different tools have very different compression characteristics. Gzip is a general purpose compressor, and sits in the middle of the space/time trade-off. Bzip2 compresses more effectively than gzip, but is slower. Bzip2’s decompression speed is faster than its compression speed, but it is still slower than the other formats. LZO and Snappy, on the other hand, both optimize for speed and are around an order of magnitude faster than gzip, but compress less effectively. Snappy is also significantly faster than LZO for decompression.
3. Issues about compression and input split
When considering how to compress data that will be processed by MapReduce, it is important to understand whether the compression format supports splitting. Consider an uncompressed file stored in HDFS whose size is 1 GB. With an HDFS block size of 64 MB, the file will be stored as 16 blocks, and a MapReduce job using this file as input will create 16 input splits, each processed independently as input to a separate map task.
Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. As before, HDFS will store the file as 16 blocks. However, creating a split for each block won’t work since it is impossible to start reading at an arbitrary point in the gzip stream and therefore impossible for a map task to read its split independently of the others. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way that would allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the next block, thereby synchronizing itself with the stream. For this reason, gzip does not support splitting.
In this case, MapReduce will do the right thing and not try to split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting. This will work, but at the expense of locality: a single map will process the 16 HDFS blocks, most of which will not be local to the map. Also, with fewer maps, the job is less granular, and so may take longer to run.
If the file in our hypothetical example were an LZO file, we would have the same problem since the underlying compression format does not provide a way for a reader to synchronize itself with the stream. However, it is possible to preprocess LZO files using an indexer tool that comes with the Hadoop LZO libraries. The tool builds an index of split points, effectively making them splittable when the appropriate MapReduce input format is used.
A bzip2 file, on the other hand, does provide a synchronization marker between blocks (a 48-bit approximation of pi), so it does support splitting.
4. IO-bound and CPU bound
Storing compressed data in HDFS allows your hardware allocation to go further since compressed data is often 25% of the size of the original data. Furthermore, since MapReduce jobs are nearly always IO-bound, storing compressed data means there is less overall IO to do, meaning jobs run faster. There are two caveats to this, however: some compression formats cannot be split for parallel processing, and others are slow enough at decompression that jobs become CPU-bound, eliminating your gains on IO.
The gzip compression format illustrates the first caveat. Imagine you have a 1.1 GB gzip file, and your cluster has a 128 MB block size. This file will be split into 9 chunks of size approximately 128 MB. In order to process these in parallel in a MapReduce job, a different mapper will be responsible for each chunk. But this means that the second mapper will start on an arbitrary byte about 128MB into the file. The contextful dictionary that gzip uses to decompress input will be empty at this point, which means the gzip decompressor will not be able to correctly interpret the bytes. The upshot is that large gzip files in Hadoop need to be processed by a single mapper, which defeats the purpose of parallelism.
Bzip2 compression format illustrates the second caveat in which jobs become CPU-bound. Bzip2 files compress well and are even splittable, but the decompression algorithm is slow and cannot keep up with the streaming disk reads that are common in Hadoop jobs. While Bzip2 compression has some upside because it conserves storage space, running jobs now spend their time waiting on the CPU to finish decompressing data, which slows them down and offsets the other gains.
5. Summary
Reasons to compress:
a) Data is mostly stored and not frequently processed. It is usual DWH scenario. In this case space saving can be much more significant then processing overhead
b) Compression factor is very high and thereof we save a lot of IO.
c) Decompression is very fast (like Snappy) and thereof we have a some gain with little price
d) Data already arrived compressed
Reasons not to compress
a) Compressed data is not splittable. Have to be noted that many modern format are built with block level compression to enable splitting and other partial processing of the files. b) Data is created in the cluster and compression takes significant time. Have to be noted that compression usually much more CPU intensive then decompression.
c) Data has little redundancy and compression gives little gain.

Storage format in HDFS

How Does HDFS store data?
I want to store huge files in a compressed fashion.
E.g : I have a 1.5 GB of file, with default replication factor of 3.
It requires (1.5)*3 = 4.5 GB of space.
I believe currently no implicit compression of data takes place.
Is there a technique to compress the file and store it in HDFS to save disk space ?
HDFS stores any file in a number of 'blocks'. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)
So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.
If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS - the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.
However, one thing to consider when using compression is whether the compression method supports splitting the file - that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).
Even if the method doesn't support splitting, hadoop will still store the file in a number of blocks, but you'll lose some benefit of 'data locality' as the blocks will most probably be spread around your cluster.
In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.
Hope this makes sense
EDIT Some additional info in response to comments:
When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:
setCompressOutput(Job, boolean)
setOutputCompressorClass(Job, Class)
When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)
Some time ago I tried to summarize that in a blog post here.
Essentially that is a question of data splittability, as a file is devided into blocks which are elementary blocks for replication. Name node is responsible for keeping track of all those blocks belonging to one file. It is essential that block is autonomous when choosing compression - not all codecs are splittable. If the format + codec is not splittable that means that in order to decompress it it needs to be in one place which has big impact on parallelism in mapreduce. Essentially running in single slot.
Hope that helps.
Have a look at presentation # Hadoop_Summit, especially Slide 6 and Slide 7.
If DFS block size is 128 MB, for 4.5 GB storage (including replication factor of 3), you need 35.15 ( ~36 blocks)
Only bzip2 file format is splittable. In other formats, all blocks of entire files are stored in same Datanode
Have a look at algorithm types and class names and codecs
#Chris White answer provides information on how to enable zipping while writing Map output
The answer to this question is to first understand the file format available in Hadoop today. There is now choice available within HDFS that can manage file format and compression techniques. Alternative to explicit encoding and splitting using LZO or BZIP. There is many format that today support block compression and columnar row compression with features.
A storage format is a way you define how information is to be stored. This is sometimes usually indicated by the extension of the file. For example we know images can be several storage formats, PNG, JPG, and GIF etc. All these formats can store the same image, but each has specific storage characteristics.
In Hadoop filesystem you have all of traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data.
Why is it important to know these formats
In any performance tradeoffs, a huge bottleneck for HDFS-enabled applications like MapReduce, Hive, HBase, and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are accentuated when you manage large datasets. The Hadoop file formats have evolved to ease these issues across a number of use cases.
Choosing an appropriate file format can have some significant benefits:
Optimum read time
Optimum write time
Spliting or partitioning of files (so you don’t need to read the whole file, just a part of it)
Schema adaption (allowing a field changes to a dataset) Compression support (without sacrificing these features)
Some file formats are designed for general use, others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice when storing data in Hadoop and one should know to optimally store data in HDFS. Currently my go to storage is ORC format.
Check if your Big data components (Spark, Hive, HBase etc) support these format and make the decision accordingly. For example, I am currently injecting data into Hive and converting it into ORC format which works for me in terms of compression and performance.
Some common storage formats for Hadoop include:
Plain text storage (eg, CSV, TSV files, Delimited file etc)
Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX world. Text-files are inherently splittable. but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2. This is not efficient and will require a bit of work when performing MapReduce tasks.
Sequence Files
Originally designed for MapReduce therefore very easy to integrate with Hadoop MapReduce processes. They encode a key and a value for each record and nothing more. Stored in a binary format that is smaller than a text-based format. Even here it doesn't encode the key and value in anyway. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. Though still not efficient as per statistics like Parquet and ORC.
Avro
The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Its file format with additional framework for, serialization and deserialization framework. With regular old sequence files you can store complex objects but you have to manage the process. It also supports block-level compression.
Parquet
My favorite and hot format these days. Its a columnar file storage structure while it encodes and writes to the disk. So datasets are partitioned both horizontally and vertically. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). Try using this if your processing can optimally use column storage. You can refer to advantages of columnar storages.
If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required.
ORC
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
It is similar to the Parquet but with different encoding technique. Its not for this thread but you can lookup on Google for differences.

Resources