mapred.min.split.size - hadoop

I am trying to experiment this parameter in MapReduce and I have some question.
Does this go by the size in HDFS (whether it is compressed or not)? Or is it after uncompression? I guess it is the former but just want to confirm.

This parameter will only be used if your input format supports splitting the input files. Common compression codecs (such as gzip) don't support splitting the files, so this will be ignored.
If the input format does support splitting, then this relates to the compressed size.

From Hadoop 0.21 I think the bz2 files are splittable. So you can use bz2.

Related

Find compression codec used for an hadoop file

Given a compressed file, written on hadoop platform, in one of the following formats:
Avro
Parquet
SequenceFile
How can I find the compression codec used? Assuming that one of the following compression codecs is used (and there is no file extension in the file name):
Snappy
Gzip (not supported on Avro)
Deflate (not supported on Parquet)
The Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The command you are looking for is meta. This will show all kinds of metadata, including compressions. You can find an example output here, showing SNAPPY compression.
Please note that the compression algorithm does not have to be the same across the whole file. Different column chunks can use different compressions, therefore there is no single field for the compression codec, but one for each column chunk instead. (A column chunk is the part of a column that belong to one row group.) In practice, however, you will probably find the same compression codec being used for all column chunks.
A similar utility exists for Avro, called avro-tool. I'm not that familiar with it, but it has a getmeta command which should show you the compression codec used.

Gzip compression using boost library

I want a compress write a program which can compress directory and all its files in a .gz file. I have tried using using gzip filter but I dont know how can I add directory and multiple files. Also I would like to uncompress the same.
gzip by itself only compresses a single stream of data with no assumed structure. To archive directories using gzip, it is most commonly combined with tar, which has the ability to compress using gzip built in. I'm sure you have seen those sorts of files, which end in .tar.gz. You can probably find a library that processes those files.

Apache Pig handles bz2 file natively?

I can see that pig can read .bz2 files natively but I am not sure whether it runs an explicit job to split bz2 into multiple inputsplits? Can anyone confirm this? If pig is running a job to create inputsplits, is there a way to avoid that? I mean a way to have MapReduce framework split bz2 files into muplitple inputslits in the framework level?
Splittable input formats are not implemented in hadoop (or in pig, which just runs MR jobs for you) such that a file is split by one job, then the splits processed by a second job.
The input format defines an isSplittable method which defines whether in principal the file format can be split. In addition to this, most text based formats will check to see whether the file is using a known compression codec (for example: gzip, bzip2) and if the codec support splits (gzip doesn't, in principal, but bz2 does).
If the input format / codec does allow for splitting of the files, then splits are defined at defined (and configurable) points in the compressed file (say every 64 MB). When the map tasks are created to process each split, then get the input format to create a record reader for the file, passing the split information for where the reader should start from (the 64MB block offset). The reader is then told to seek to the offset point of the split. At this point the underlying codec will seek to that point in the compressed file, and scan forward until it finds the next compressed block header (in the case of bz2). Reads then continue as normal on the uncompressed stream returned from the codec, until the split end point has been passed over in the uncompressed stream.

Lossless JPEG - can't find any example images, DICOM files

I'm currently working on the lossless JPEG files(not JPEG-LS). It's really hard to find any files to test my application on.
Particulary I need files that contain reset interval markers, multiple DC huffman tables, multiple scenes or comment markers.
Do you know where I could find any lossless JPEG files? Do you yourself have any that you could share?
Thanks in advance, Witek.
EDIT: i could also use DICOM files using this compression standard (tag (0002,0010) Transfer syntax UID = 1.2.840.10008.1.2.4.70)
On the following site you can find a few DICOM lossless JPEG files, in particular with the transfer syntaxes 1.2.840.10008.1.2.4.57 and .70. Consult the Transfer Syntax section for easy identification of which data sets that provide the requested transfer syntax.
There are also a number of lossless JPEG images of different flavors on the NEMA DICOM FTP site. For more detailed information on the various data sets, please consult the README file.
Here's a large collection of dicom sample images: There are some JPEG lossless images among them. Some subfolders have images that are not valid DICOM, but that is usually documented. By the same maintainer there is also this list of links.
Lossless JPEG is most widely used in XA (cathlab) cine images. These are always grayscale, and exist as 8 or 10 bit images.
You could also setup a free PACS like DCM4CHEE or conquest, send it uncompressed images and have them forward the images jpeg-lossless compressed. The advantage of this is that you can create images of different color spaces, bit depths, planar/bypixel, etcetera. Color spaces are interesting: people sometimes make mistakes to transform the color space like for Jpeg lossy, which you should not do.
Most likely none of these images require advanced stuff like restart markers. If you want to check if this works, create bitstreams with the IJG implementation and package them in DICOM.
EDIT: be warned that there are buggy images out there. I am using an implementation based on the IJG code.

How to compress or Zip whole folder using GZipStream

Any idea how I can do this? I am able to compress a single file.
You cannot GZip an entire folder directly, since GZip operates on a single stream of data. You will first have to turn the folder into such a stream.
One way to do this would be to create a Tar archive from the directory. This will give you a single stream to work on, and since the Tar format is not compressed, GZip will usually achieve good compression ratios on Tar files.
GZip doesn't support multiple files. They have to be combined in another container first like Tar. If you need full Zip support for C# use this library:
http://www.icsharpcode.net/opensource/sharpziplib/

Resources