I've done to make gzip compressed file with deflateInit2, deflate, deflateEnd of zlib in c source code. But, the output is different from the original one that made through 'gzip -f -n -9 file-name' command.
Are there any other methods to make the completely same?
additional situation descriptions -
my implementation sample is like below...
deflateInit2(&strm, 9, Z_DEFLATED, 31, 9, Z_DEFAULT_STRATEGY);
deflate(&strm, Z_FINISH);
deflateEnd(&strm);
but, the result is different from the middle. Until the offset 0x90656 of compressed image from my apis is the same.
the original compressed image size is 6,739KB but, the compressed image size from my apis is about 6,743KB.
Related
It is generally said that any compression format like Gzip, when used along with a container file format like avro and sequence (file formats), will make the compression format splittable.
Does this mean that the blocks in the container format get compressed based on the preferred compression (like gzip) or something else. Can someone please explain this? Thanks!
Well, I think the question requires an update.
Update:
Do we have a straightforward approach to convert a large file in a non-splittable file compression format (like Gzip) into a splittable file (using a container file format such as Avro, Sequence or Parquet) to be processed by MapReduce?
Note: I do not mean to ask for workarounds such as uncompressing the file, and again compressing the data using a splittable compression format.
For Sequence files if you specify BLOCK compression, each block will be compressed using the specified compression codec. Blocks allow Hadoop to split data at the block level, while using compression (where the compression itself isn't splitable) and skip whole blocks without needing to decompress them.
Most of this is described on the Hadoop wiki: https://wiki.apache.org/hadoop/SequenceFile
Block compressed key/value records - both keys and values are
collected in 'blocks' separately and compressed. The size of the
'block' is configurable.
For Avro this is all very similar as well: https://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files
Objects are stored in blocks that may be compressed. Syncronization
markers are used between blocks to permit efficient splitting of files
for MapReduce processing.
Thus, each block's binary data can be efficiently extracted or skipped
without deserializing the contents.
The easiest (and usually fastest) way to convert data from one format into another is to let MapReduce do the work for you. In the example of:
GZip Text -> SequenceFile
You would have a map only job that uses the TextInputFormat for input and outputs SequenceFileFormat. This way you get a 1-to-1 conversion on the number of files (add a reduce step if this needs changing) and you do the conversion in parallel if there are lots of files to convert.
Do not know what you are really talking about... but any file can be splitted at any point.
Why i say this... hoping you are using something like Linux or similar.
On Linux it is (not too much) easy to create a block device that is really stored on the concatenation of some files.
I mean:
You split a file in as many chunks as you want, each of a different size, no need to be ood or even size, multiple of 512 bytes, etc., whatever size you want, mathematicaly expresed splitted_file_size=(desired_size mod 1).
You define a block device that concatenates all files in the correct order
You define a symbolic link to such device
That way you can have a BIG file (more than 16GiB, more than 4GiB) stored on one FAT32 partition (that has a limit of 4GiB-1 bytes per file)... and access it on-the-fly and transparently... thinking only on read.
For read/write... there is a trick (that is the complex part) that works:
Split the file (this time in chunks of N*512 bytes)
Define a device driver parametrized (so it knows how to allocate more chunks by creating more files)
On Linux i had used on the past some tools (command line) that do all the job, and they let you create a virtual container resizable on the fly, that will use files of an exact size (including the last one) and exposes it as a regular block device (where you can do a dd if=... of=... to fill it) and a virtual file associated with it.
That way you have:
Some not so big files of identical size
They will hold inside the real data of the stream
They are created / deleted as needed (grow / shrink or truncate)
They are exposed as a regular file on some point
Accesing such file will be as seen the concatenation
Maybe that gives you idea on other aproach to the problem you are having:
Instead of tweak the compression system, just put a layer (a little bit more complex that a simple loop device) that do on the fly and transparently the split/join
Such tools exist, i do not remember the name, sorry! But i remember the one for read only (dvd_double_layer.* are on a FAT32):
# cd /mnt/FAT32
# ls -lh dvd_double_layer.*
total #
-r--r--r-- 1 root root 3.5G 2017-04-20 13:10 dvd_double_layer.000
-r--r--r-- 1 root root 3.5G 2017-04-20 13:11 dvd_double_layer.001
-r--r--r-- 1 root root 0.2G 2017-04-20 13:12 dvd_double_layer.002
# affuse dvd_double_layer.000 /mnt/transparent_concatenated_on_the_fly
# cd /mnt/transparent_concatenated_on_the_fly
# ln -s dvd_double_layer.000.raw dvd_double_layer.iso
# ls -lh dvd_double_layer.*
total #
-r--r--r-- 1 root root 7.2G 2017-04-20 13:13 dvd_double_layer.000.raw
-r--r--r-- 1 root root 7.2G 2017-04-20 13:14 dvd_double_layer.iso
Hope this idea can help you.
I was wondering, is there any way to write an array of bytes as a whole data block (a number of tags) to .tiff file using libTIFF?
As far as I know, .tiff file is not streamable due to random data block (IFD) location. But, as I may assume, data inside that block is being written in a predefined order. What I am trying to accomplish is to write the whole exif properties byte block from jpeg file to "Exif IFD" inside my tiff.
So, is there any function like TIFFSetField(), that populates the whole data block (IFD)?
You can write a TIFF data block of your own, and inside that block, data are formatted with a "private" format (it is the case, if memory serves, of RichTIFFIPTC). What you cannot do is send several tags to the TIFF object and expect them to end up in any particular order.
I believe that Photoshop, among others, always writes a fixed length data object as a single tag, and then rewrites its innards at leisure.
Due to the fact that EXIF collection and TIFF Tag collections overlap, you cannot do this and have your tags readable by libTIFF, though:
[tag1],[tag2],[tag3] ---> [privateTiffLongObject] --> not re-readable
[tag1],[tag2],[tag3] ---> [Tiff2],[Tiff3],[Tiff1] --> re-readable
That said, what is it that you're trying to accomplish? To simply shuttle tags from a JPEG file to a TIFF file, I'd venture that exiftool should be enough. I have often employed a workflow like the following:
(image) --> exiftool --> XML --> XML parsers -->
--> exiftool --> (new image)
Of course, if you need to do this for a large batch of images, performances may become a problem. That issue can be tackled more easily with RAM disks and SSD devices, though.
"Hacking" the TIFF format might leave you with files that are efficiently written and correctly handled by the software tools you now have, but won't be compatible with some other tool elsewhere -- and this might be discovered after you've done weeks of work.
Apparently not. The documentation states:
The TIFF specification requires that all information except an 8-byte
header can be placed anywhere in a file. In particular, it is
perfectly legitimate for directory information to be written after the
image data itself. Consequently TIFF is inherently not suitable for
passing through a stream-oriented mechanism such as UNIX pipes.
Software that require that data be organized in a file in a particular
order (e.g. directory information before image data) does not
correctly support TIFF. libtiff provides no mechanism for controlling
the placement of data in a file; image data is typically written
before directory information.
I have thousands (or more) of gzipped files in a directory (on a Windows system) and one of my tools consumes those gzipped files. If it encounters a corrupt gzip file, it conveniently ignores them instead of raising an alarm.
I have been trying to write a Perl program that loops through each file and makes a list of files which are corrupt.
I am using the Compress::Zlib module, and have tried reading the first 1KB of each file, but that did not work since some of the files are corrupted towards the end (verified during the manual extract, alarm raised only towards the end) and reading first 1KB doesn't show a problem. I am wondering if a CRC check of these files will be of any help.
Questions:
Will CRC validation work in this case? If yes, how does it work? Will the true CRC be part of the gzip header, and we are to compare it with the calculated CRC from the file we have? How do I accomplish this in Perl?
Are there any other simpler ways to do this?
In short, the only way to check a gzip file is to decompress it until you get an error, or get to the end successfully. You do not however need to store the result of the decompression.
The CRC stored at the end of a gzip file is the CRC of the uncompressed data, not the compressed data. To use it for verification, you have to decompress all of the data. This is what gzip -t does, decompressing the data and checking the CRC, but not storing the uncompressed data.
Often a corruption in the compressed data will be detected before getting to the end. But if not, then the CRC, as well as a check against an uncompressed length also stored at the end, will with a probability very close to one detect a corrupted file.
The Archive::Zip FAQ gives some very good guidance on this.
It looks like the best option for you is to check the CRC of each member of the archives, and a sample program that does this -- ziptest.pl -- comes with the Archive::Zip module installation.
It should be easy to test the file is not corrupt by just using "gunzip -t" command, gunzip is available for windows as well and should come with gzip package.
As I understand, an index file is needed to make the output Splitable. If mapred.output.compression.type=SequenceFile.CompressionType.RECORD, do we still need to create an Index file?
Short answer:
RECORD and BLOCK compression.type properties apply to sequence files, not to simple text files (which can be independently compressed with lzo or gzip or bz2 ...)
More info:
LZO is a compression codec which gives better compression and decompression speed than gzip, and also the capability to split. LZO allows this because its composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries, as opposed to gzip where the dictionary for the whole file is written at the top.
When you specify mapred.output.compression.codec as LzoCodec, hadoop will generate .lzo_deflate files. These contain the raw compressed data without any header, and cannot be decompressed with lzop -d command. Hadoop can read these files in the map phase, but this makes your life hard.
When you specify LzopCodec as the compression.codec, hadoop will generate .lzo files. These contain the header and can be decompressed using lzop -d
However, neither .lzo nor .lzo_deflate files are splittable by default. This is where LzoIndexer comes into play. It generates an index file which tells you where the record boundary is. This way, multiple map tasks can process the same file.
See this cloudera blog post and LzoIndexer for more info.
I've been doing some research on compression-based text classification and I'm trying to figure out a way of storing a dictionary built by the encoder (on a training file) for use to run 'statically' on a test file? Is this at all possible using UNIX's gzip utility?
For example I have been using 2 'class' files of sport.txt and atheism.txt, hence I want to run compression on both of these files and store their dictionaries used. Next I want to take a test file (which is unlabelled, could be either atheism or sport) and by using the prebuilt dictionaries on this test.txt I can analyse how well it compresses under that dictionary/model.
Thanks
deflate encoders, as in gzip and zlib, do not "build" a dictionary. They simply use the previous 32K bytes as a source for potential matches to the string of bytes starting at the current position. The last 32K bytes is called the "dictionary", but the name is perhaps misleading.
You can use zlib to experiment with preset dictionaries. See the deflateSetDictionary() and inflateSetDictionary() functions. In that case, zlib compression is primed with a "dictionary" of 32K bytes that effectively precede the first byte being compressed as a source for matches, but the dictionary itself is not compressed. The priming can only improve the compression of the first 32K bytes. After that, the preset dictionary is too far back to provide matches.
gzip provides no support for preset dictionaries.