Find compression codec used for an hadoop file - hadoop

Given a compressed file, written on hadoop platform, in one of the following formats:
Avro
Parquet
SequenceFile
How can I find the compression codec used? Assuming that one of the following compression codecs is used (and there is no file extension in the file name):
Snappy
Gzip (not supported on Avro)
Deflate (not supported on Parquet)

The Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The command you are looking for is meta. This will show all kinds of metadata, including compressions. You can find an example output here, showing SNAPPY compression.
Please note that the compression algorithm does not have to be the same across the whole file. Different column chunks can use different compressions, therefore there is no single field for the compression codec, but one for each column chunk instead. (A column chunk is the part of a column that belong to one row group.) In practice, however, you will probably find the same compression codec being used for all column chunks.
A similar utility exists for Avro, called avro-tool. I'm not that familiar with it, but it has a getmeta command which should show you the compression codec used.

Related

Best image format for/in CUDA image processing

i am new to image-processing in CUDA.
I am currently learning whatever i can about this.
Can anyone tell me what is the appropriate format (extension of image) for storing and accessing image files so that CUDA processing would have the most efficiency.
And y does all the sample cuda programs for image processing use .ppm file format for images.
And can i convert the images in other format to that format.
And how can i access those files (CUDA Code)?
Most image formats are created for efficient exchange of images, ie. on media (hard disk), the internet, etc.
For computation, the most useful representation of an image is usually in some raw, uncompressed format.
CUDA doesn't have any intrinsic functions that are used to manipulate an image in one of the interchange formats (e.g. .jpg, .png, .ppm, etc.) You should use some other library to convert an image in one of the interchange formats to a raw uncompressed format, and then you can operate on it directly in host code or in CUDA device code. Since CUDA doesn't recognize any interchange format, there is no one format that is correct or best to use. It will depend on other requirements you may have.
The sample programs that have used the .ppm format have simply done so for convenience. There are plenty of sample codes out there that use other formats such as .jpg or .bmp to store an image used by a CUDA program.

Apache Pig handles bz2 file natively?

I can see that pig can read .bz2 files natively but I am not sure whether it runs an explicit job to split bz2 into multiple inputsplits? Can anyone confirm this? If pig is running a job to create inputsplits, is there a way to avoid that? I mean a way to have MapReduce framework split bz2 files into muplitple inputslits in the framework level?
Splittable input formats are not implemented in hadoop (or in pig, which just runs MR jobs for you) such that a file is split by one job, then the splits processed by a second job.
The input format defines an isSplittable method which defines whether in principal the file format can be split. In addition to this, most text based formats will check to see whether the file is using a known compression codec (for example: gzip, bzip2) and if the codec support splits (gzip doesn't, in principal, but bz2 does).
If the input format / codec does allow for splitting of the files, then splits are defined at defined (and configurable) points in the compressed file (say every 64 MB). When the map tasks are created to process each split, then get the input format to create a record reader for the file, passing the split information for where the reader should start from (the 64MB block offset). The reader is then told to seek to the offset point of the split. At this point the underlying codec will seek to that point in the compressed file, and scan forward until it finds the next compressed block header (in the case of bz2). Reads then continue as normal on the uncompressed stream returned from the codec, until the split end point has been passed over in the uncompressed stream.

Lossless JPEG - can't find any example images, DICOM files

I'm currently working on the lossless JPEG files(not JPEG-LS). It's really hard to find any files to test my application on.
Particulary I need files that contain reset interval markers, multiple DC huffman tables, multiple scenes or comment markers.
Do you know where I could find any lossless JPEG files? Do you yourself have any that you could share?
Thanks in advance, Witek.
EDIT: i could also use DICOM files using this compression standard (tag (0002,0010) Transfer syntax UID = 1.2.840.10008.1.2.4.70)
On the following site you can find a few DICOM lossless JPEG files, in particular with the transfer syntaxes 1.2.840.10008.1.2.4.57 and .70. Consult the Transfer Syntax section for easy identification of which data sets that provide the requested transfer syntax.
There are also a number of lossless JPEG images of different flavors on the NEMA DICOM FTP site. For more detailed information on the various data sets, please consult the README file.
Here's a large collection of dicom sample images: There are some JPEG lossless images among them. Some subfolders have images that are not valid DICOM, but that is usually documented. By the same maintainer there is also this list of links.
Lossless JPEG is most widely used in XA (cathlab) cine images. These are always grayscale, and exist as 8 or 10 bit images.
You could also setup a free PACS like DCM4CHEE or conquest, send it uncompressed images and have them forward the images jpeg-lossless compressed. The advantage of this is that you can create images of different color spaces, bit depths, planar/bypixel, etcetera. Color spaces are interesting: people sometimes make mistakes to transform the color space like for Jpeg lossy, which you should not do.
Most likely none of these images require advanced stuff like restart markers. If you want to check if this works, create bitstreams with the IJG implementation and package them in DICOM.
EDIT: be warned that there are buggy images out there. I am using an implementation based on the IJG code.

mapred.min.split.size

I am trying to experiment this parameter in MapReduce and I have some question.
Does this go by the size in HDFS (whether it is compressed or not)? Or is it after uncompression? I guess it is the former but just want to confirm.
This parameter will only be used if your input format supports splitting the input files. Common compression codecs (such as gzip) don't support splitting the files, so this will be ignored.
If the input format does support splitting, then this relates to the compressed size.
From Hadoop 0.21 I think the bz2 files are splittable. So you can use bz2.

File writer filter creating a bigger AVI file then original

I am using the SampleGrabber filter to get the frames of an AVI file and alter them before writing them to another (new) AVI file using the File writer filter.
The problem that I am facing is that the new AVI file size is greater then the original file. I removed the SampleGrabber filter thinking that it might be my code causing the problem, but still the new file size is greater then the original file. I tested it with graphedit.
The filters used were File reader->AVI Splitter->AVI Mux->File writer.
I really want to preserve the file size. Is there any other filter or property that I have to set. At the moment I am only adding the Filters in GraphBuilder and rendering the file.
I am using DirectShowLib.Net.
I just did a quick test using
File source (async) -> AVI splitter -> AVI mux -> file writer
in graphedit and the output file always seems to come out the same size as the input for me. The only thing I can think of is that your input file might be compressed. It might be worth inspecting the input file with an app like gspot to determine that. As I understand it DirectShow will sometimes insert appropriate filters in order to make a connection, so if you're trying to connect your file source to an AVI splitter it may insert a decompressor if needed. Hope that's of some use

Resources