Apache Pig handles bz2 file natively? - hadoop

I can see that pig can read .bz2 files natively but I am not sure whether it runs an explicit job to split bz2 into multiple inputsplits? Can anyone confirm this? If pig is running a job to create inputsplits, is there a way to avoid that? I mean a way to have MapReduce framework split bz2 files into muplitple inputslits in the framework level?

Splittable input formats are not implemented in hadoop (or in pig, which just runs MR jobs for you) such that a file is split by one job, then the splits processed by a second job.
The input format defines an isSplittable method which defines whether in principal the file format can be split. In addition to this, most text based formats will check to see whether the file is using a known compression codec (for example: gzip, bzip2) and if the codec support splits (gzip doesn't, in principal, but bz2 does).
If the input format / codec does allow for splitting of the files, then splits are defined at defined (and configurable) points in the compressed file (say every 64 MB). When the map tasks are created to process each split, then get the input format to create a record reader for the file, passing the split information for where the reader should start from (the 64MB block offset). The reader is then told to seek to the offset point of the split. At this point the underlying codec will seek to that point in the compressed file, and scan forward until it finds the next compressed block header (in the case of bz2). Reads then continue as normal on the uncompressed stream returned from the codec, until the split end point has been passed over in the uncompressed stream.

Related

How to split a video up into <2.5GB parts with FFmpeg

I am trying to achieve a way to send large video files through Firefox Send.
Because Firefox Send has a 2.5 GB limit per file that one sends, I need to break up a video file into parts that are each less than 2.5GB.
Is there a relatively simple way to reliably split a video based on data limits using FFmpeg, rather than using duration? (Using duration would be unreliable, because different equal length portions of a video can be different sized)
EDIT 1: I apoligize for the lack of clarity, I was planning on using a Bash script using FFmpeg and ffsend. I was wondering if there is any way to do this through video processing rather than zip compression.
The standard utility split is intended for precisely this sort of thing.
# sender does:
split -b 2500m file.mpg file.mpg__split_
# recipient downloads all the pieces and does:
cat file.mpg__split_* > file.mpg
A disadvantage of this procedure is that the individual parts are not usable.
An advantage is that the final output is identical to the original.

Find compression codec used for an hadoop file

Given a compressed file, written on hadoop platform, in one of the following formats:
Avro
Parquet
SequenceFile
How can I find the compression codec used? Assuming that one of the following compression codecs is used (and there is no file extension in the file name):
Snappy
Gzip (not supported on Avro)
Deflate (not supported on Parquet)
The Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The command you are looking for is meta. This will show all kinds of metadata, including compressions. You can find an example output here, showing SNAPPY compression.
Please note that the compression algorithm does not have to be the same across the whole file. Different column chunks can use different compressions, therefore there is no single field for the compression codec, but one for each column chunk instead. (A column chunk is the part of a column that belong to one row group.) In practice, however, you will probably find the same compression codec being used for all column chunks.
A similar utility exists for Avro, called avro-tool. I'm not that familiar with it, but it has a getmeta command which should show you the compression codec used.

Parsing split video with ffmpeg

I have video file split into few chunks. Split done and random file positions, but chunks are large enough.
I need to parse every part with different instances of AVFormatContext. Chunks come one after another in right order. I think there are two options here:
Being able to save and restore AVFormatContext state;
Save video file header (from first chunk) and attach it to every chunk.
I tried both but no success. First approach requires to go too deeply beyond public API of ffmpeg. With second approach I am unable to merge header with new chunk so that ffmpeg can handle it.
Can you help me with this?
Thank you.
It totally depends on the file type. MP4 for example, the header must be completely rewritten, and can not just be copied. Flv the header can probably just be copied, but MUST be split on frame boundary and not randomly. TS could do this, but you would miss a frame at the cut point.
Realistically, the file will need to be reassembled, the split correctly.

mapred.min.split.size

I am trying to experiment this parameter in MapReduce and I have some question.
Does this go by the size in HDFS (whether it is compressed or not)? Or is it after uncompression? I guess it is the former but just want to confirm.
This parameter will only be used if your input format supports splitting the input files. Common compression codecs (such as gzip) don't support splitting the files, so this will be ignored.
If the input format does support splitting, then this relates to the compressed size.
From Hadoop 0.21 I think the bz2 files are splittable. So you can use bz2.

File writer filter creating a bigger AVI file then original

I am using the SampleGrabber filter to get the frames of an AVI file and alter them before writing them to another (new) AVI file using the File writer filter.
The problem that I am facing is that the new AVI file size is greater then the original file. I removed the SampleGrabber filter thinking that it might be my code causing the problem, but still the new file size is greater then the original file. I tested it with graphedit.
The filters used were File reader->AVI Splitter->AVI Mux->File writer.
I really want to preserve the file size. Is there any other filter or property that I have to set. At the moment I am only adding the Filters in GraphBuilder and rendering the file.
I am using DirectShowLib.Net.
I just did a quick test using
File source (async) -> AVI splitter -> AVI mux -> file writer
in graphedit and the output file always seems to come out the same size as the input for me. The only thing I can think of is that your input file might be compressed. It might be worth inspecting the input file with an app like gspot to determine that. As I understand it DirectShow will sometimes insert appropriate filters in order to make a connection, so if you're trying to connect your file source to an AVI splitter it may insert a decompressor if needed. Hope that's of some use

Resources