Gzip compression using boost library - c++11

I want a compress write a program which can compress directory and all its files in a .gz file. I have tried using using gzip filter but I dont know how can I add directory and multiple files. Also I would like to uncompress the same.

gzip by itself only compresses a single stream of data with no assumed structure. To archive directories using gzip, it is most commonly combined with tar, which has the ability to compress using gzip built in. I'm sure you have seen those sorts of files, which end in .tar.gz. You can probably find a library that processes those files.

Related

Does Mosaic supports ingesting compressed data?

We have a scenario of uploading compressed files into Blob container in Microsoft Azure and then read it.
Is it possible in Mosaic to do it and if yes, what is the way to achieve it?
We have files in .gz format.
Yes you can upload and read compressed files in Mosaic through Azure Reader.
Currently, Mosaic supports two compression types - .ZIP & .GZ
To read compressed files in Mosaic's Azure Reader node you can follow below steps -
In Path field, provide the path of the compressed folder as shown in screen shot below
Make the toggle button for Is Compressed is True
Select the compression type - (either .ZIP or .GZ)
In compressed path we will have to provide the file without the compressed extension.
e.g. if the compressed file is ‘ABC.csv.gz’ then in compressed path it would be ‘ABC.csv’
Similarly for files compressed in .zip format, the compressed path will be the path of files within that compressed folder.
e.g. compressed folder is ‘ABC.zip’ then compressed path would be ‘ABC/file.csv’
Select the format of the file and Validate.

Find compression codec used for an hadoop file

Given a compressed file, written on hadoop platform, in one of the following formats:
Avro
Parquet
SequenceFile
How can I find the compression codec used? Assuming that one of the following compression codecs is used (and there is no file extension in the file name):
Snappy
Gzip (not supported on Avro)
Deflate (not supported on Parquet)
The Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The command you are looking for is meta. This will show all kinds of metadata, including compressions. You can find an example output here, showing SNAPPY compression.
Please note that the compression algorithm does not have to be the same across the whole file. Different column chunks can use different compressions, therefore there is no single field for the compression codec, but one for each column chunk instead. (A column chunk is the part of a column that belong to one row group.) In practice, however, you will probably find the same compression codec being used for all column chunks.
A similar utility exists for Avro, called avro-tool. I'm not that familiar with it, but it has a getmeta command which should show you the compression codec used.

Apache Pig handles bz2 file natively?

I can see that pig can read .bz2 files natively but I am not sure whether it runs an explicit job to split bz2 into multiple inputsplits? Can anyone confirm this? If pig is running a job to create inputsplits, is there a way to avoid that? I mean a way to have MapReduce framework split bz2 files into muplitple inputslits in the framework level?
Splittable input formats are not implemented in hadoop (or in pig, which just runs MR jobs for you) such that a file is split by one job, then the splits processed by a second job.
The input format defines an isSplittable method which defines whether in principal the file format can be split. In addition to this, most text based formats will check to see whether the file is using a known compression codec (for example: gzip, bzip2) and if the codec support splits (gzip doesn't, in principal, but bz2 does).
If the input format / codec does allow for splitting of the files, then splits are defined at defined (and configurable) points in the compressed file (say every 64 MB). When the map tasks are created to process each split, then get the input format to create a record reader for the file, passing the split information for where the reader should start from (the 64MB block offset). The reader is then told to seek to the offset point of the split. At this point the underlying codec will seek to that point in the compressed file, and scan forward until it finds the next compressed block header (in the case of bz2). Reads then continue as normal on the uncompressed stream returned from the codec, until the split end point has been passed over in the uncompressed stream.

mapred.min.split.size

I am trying to experiment this parameter in MapReduce and I have some question.
Does this go by the size in HDFS (whether it is compressed or not)? Or is it after uncompression? I guess it is the former but just want to confirm.
This parameter will only be used if your input format supports splitting the input files. Common compression codecs (such as gzip) don't support splitting the files, so this will be ignored.
If the input format does support splitting, then this relates to the compressed size.
From Hadoop 0.21 I think the bz2 files are splittable. So you can use bz2.

How to compress or Zip whole folder using GZipStream

Any idea how I can do this? I am able to compress a single file.
You cannot GZip an entire folder directly, since GZip operates on a single stream of data. You will first have to turn the folder into such a stream.
One way to do this would be to create a Tar archive from the directory. This will give you a single stream to work on, and since the Tar format is not compressed, GZip will usually achieve good compression ratios on Tar files.
GZip doesn't support multiple files. They have to be combined in another container first like Tar. If you need full Zip support for C# use this library:
http://www.icsharpcode.net/opensource/sharpziplib/

Resources