Optimal size for zlib compression? - performance

I have seen threads about minimal and maximum size for zlib compression. I wanted to know what people think is the optimal size for a compressed block of data that will ensure the best speed. Is there an advantage to splitting up a file into say, multiple file blocks.
Thank you.

Splitting data into blocks will only degrade compression ratio, and is unlikely to improve speed.
The main idea behind "splitting into small blocks" is improve access : say you want to read a segment of the file at position PX, then you immediately know it is stored into block BY = PX / BlockSize. Therefore, instead of decoding the whole file, you only decode the block.
And that's it. If you're looking for better speed, you'll have to use some different compression algorithm, such as Snappy or LZ4, which are known to have compression and decompression speed several times faster than zlib.

Related

PNG optimization and environment impact

PNG can be optimized in different ways.
Reducing PNG size for websites seems to be a good choice in order to reduce bandwidth usage and data transport and reduce the impact on environment. It's recommended by Google and other services.
But does optimize PNG have an impact on CPU load when the computer read it to render it? In other words, is reducing PNG size to optimize size a good idea to limit environment impact?
does optimize PNG have an impact on CPU load when the computer read it to render it? In other words, is reducing PNG size to optimize size a good idea to limit environment impact?
I'm not sure that one question follows from the other, but, regarding the first one:
No, the effort to optimize the PNG compression, so as to make the image smaller, only impacts on the compressor side. In the client (decompressor) side, the difference on speed or CPU usage is practically zero.
To be more precise: PNG compression is mainly influenced by two factors: the "pixel filtering" algorithm/strategy (PNG specific) and the ZLIB compression level. Optimizing the first one has no effect on the decompression
(the "unfiltering" logic is one and the same). The second factor, also, has little or no influence on decompression speed (it might even be slightly beneficious).
As #leonbloy stated in his answer, the zlib compression level has little or no effect on the decompression speed. The PNG filter selection does, however, have a noticeable effect: the AVG and PAETH filters both require more memory and CPU time to defilter than do the simpler ones.
Libpng allows you to control this via the "png_set_filter()" function. For maximum compression with adaptive filtering, use
png_set_filter(write_ptr, 0, PNG_ALL_FILTERS);
while to avoid the slower AVG and PAETH filtering, use
png_set_filter(write_ptr, 0, PNG_NONE_FILTER|PNG_SUB_FILTER|PNG_UP_FILTER);
or with libpng-1.6.22 and later, you can more conveniently use
png_set_filter(write_ptr, 0, PNG_FAST_FILTERS);
With pngcrush-1.8.1 or later, you can use
pngcrush -speed file.png file_pc.png
to select the NONE, SUB, and UP filters but avoid AVG and PAETH
(note that pngcrush also has a "-fast" option, but that is for a different purpose, namely to select fast compression methods).
When the intended use of the output is to transmit over the net to an application, the file size is the dominant effect and you'd want to use maximum compression. But when the file will be accessed from your local disc
or from memory, then the decoding speed is dominant and you'd want to use "speed" optimization which will trade off some increased file size for faster decoding.

Fast compression of large (100 MB ) text file with byte values in it

I have a large text file of size in order of 100 MB to compress. It has to be fast (12-14 seconds). What algorithms I can consider and what will be the expected compression ratio for them?
I got some file compression algorithms like FLZP,SR2,ZPAQ,Fp8,LPAQ8,PAQ9A.... which are performant among these ? The time limit is strict for me.
The algorithms that you have picked are the most well compressing in the world. Therefore, they are slow.
There are fast compression algorithms made for your use case. Names such as LZ4 and Snappy come up.
You have not defined what performance criterion you are looking for: more speed or more compression ? LZ based compressor (FLZP, LZO, LZ4, LZHAM, Snappy, ...) are the fastest. The PAQ compressors use context mixing for each bit, so they are slow but offer the best compression ratios. In between you can find things like Brotli, Zstd (they both offer a wide range of options to tune speed/compression) or the older Bzip/Bzip2. Personally I like BCM for its great speed/compression compromise and its simple code: https://github.com/encode84/bcm.

which is faster: in-memory decompression or accessing uncompressed data in HDD

I have a dataset larger than main memory. After compression, it fits into memory. However, in-memory decompression is kind of compute-intensive.
Compared to accessing uncompressed data in hard drive, does in-memory decompression have any advantage in term of time-to-completion? assuming data from HDD will loaded into memory in its entirety (i.e. no random access to HDD during processing). Anyone has done any benchmark before. Thanks.
First, the data has to be compressible. If there is no compression, then obviously compressing to the HDD and decompressing back will be slower. Many files on a HDD are not compressible because they are already compressed, e.g. image files, video files, audio files, and losslessly compressed archives like zip or .tar.gz files.
If it is compressible, zlib decompression is likely to be faster than HDD reads, and lz4 decompression is very likely to be faster.
This is the classic sort of question which can only be correctly answered with "it depends" followed by "you need to measure it for your situation".
If you can decompress at least as fast as the HDD reads the data, and you decompress in parallel with the disk read, then reading of compressed data will almost always be faster (read of smaller file will finish sooner and decompression adds only latency of the last block).
According to this benchmark a pretty weak CPU can decompress gzip at over 60MB/s.
This depends on your data, on how you're processing it, and the specs of your machine. A few considerations that make this almost impossible to answer without profiling your exact scenario:
how good is your compression? Different compression algorithms use differing amounts of CPU.
how is the data used? The amount of data that you need to buffer before processing will affect how much you can multi-thread between decompression and processing, which will massively affect your answer.
what's your environment? A 16-core server with 1TB of data to process is very different to a fancy phone with 1GB of data, but it's not clear from your question which you're dealing with (HDD suggests a computer rather than a phone at least, but server vs desktop is still relevant).
how much random access are you doing once the data is loaded? You suggest there'll be no random access to the HDD after loading, but if you're loading the full compressed data and only decompressing a portion of data at a time, the pattern of access to the data is important - you might have decompress everything twice (or more!) to process.
Ultimately this question is hugely subjective and, if you think the performance difference will be important, I'd suggest you create some basic test scenarios and profile heavily.
As a more specific example: if you're doing heavy-duty audio or visual processing, the process is CPU intensive but will typically accept a stream of data. In that scenario, compression would probably slow you down as the bottleneck will be the CPU.
Alternatively, if you're reading a billion lines of text from a file and counting the total number of vowels in each, your disk IO will probably be the bottleneck, and you would benefit from reducing the disk IO and working the CPU harder by decompressing the file.
In our case, we optimized our batch processing code that would go through structured messages (read: tweets) in batch processing mode; switching the representation from JSON to msgpack, and mapping the entire files using mmap, we got into a state where it was clearly I/O-bound with speed of the magnetic disk being the limiting factor.
We found out that the msgpacked messages containing mostly UTF-8 text could be compressed with compression ratios of 3-4 with LZ4; after switching to LZ4 decompression our optimized code was still I/O-bound, but the throughput increased significantly.
In your case, I'd start experimenting with LZ4.

How to estimate CSS text file size reduction when compressing with gziped

I'm writing a Mac app to analyse CSS files and estimate size reduction when minified. I would also like to estimate the reduction in size obtained by the http compression using gzip. How can I do that? Is there any library that can help me?
You probably are better off simply doing the minification and gzipping and just presenting the difference. The time to actually do the work is not significant enough to estimate it by some other means. Even if you are analyzing 1000s of files at once, the user will likely have an expectation that the operation will take a bit of time to complete.

I need to choose a compression algorithm

I need to choose a compression algorithm to compress some data. I don't know the type of data I'll be compressing in advance (think of it as kinda like the WinRAR program).
I've heard of the following algorithms but I don't know which one I should use. Can anyone post a short list of pros and cons? For my application the first priority is decompression speed; the second priority is space saved. Compression (not decompression) speed is irrelevant.
Deflate
Implode
Plain Huffman
bzip2
lzma
I ran a few benchmarks compressing a .tar that contained a mix of high entropy data and text. These are the results:
Name - Compression rate* - Decompression Time
7zip - 87.8% - 0.703s
bzip2 - 80.3% - 1.661s
gzip - 72.9% - 0.347s
lzo - 70.0% - 0.111s
*Higher is better
From this I came to the conclusion that the compression rate of an algorithm depends on its name; the first in alphabetical order will be the one with the best compression rate, and so on.
Therefore I decided to rename lzo to 1lzo. Now I have the best algorithm ever.
EDIT: worth noting that of all of them unfortunately lzo is the only one with a very restrictive license (GPL) :(
If you need high decompression speed then you should be using LZO. Its compression speed and ratio are decent, but it's hard to beat its decompression speed.
In the Linux kernel it is well explained (from those included):
Deflate (gzip) - Fast, worst compression
bzip2 - Slow, middle compression
lzma - Very slow compression, fast decompression (however slower than gzip), best compression
I haven't use others, so it is hard to say, but speeds of algorithms may depend largely on architecture. For example, there are studies that data compression on the HDD speeds the I/O, as the processor is so much faster than the disk that it is worth it. However, it depends largely on the size of bottlenecks.
Similarly, one algorithm may use memory extensively, which may or may not cause problems (12 MiB -- is it a lot or very small? On embedded systems it is a lot; on a modern x86 it is tiny fragment of memory).
Take a look at 7zip. It's open source and contains 7 separate compression methods. Some minor testing we've done shows the 7z format gives a much smaller result file than zip and it was also faster for the sample data we used.
Since our standard compression is zip, we didn't look at other compression methods yet.
For a comprehensive benchmark on text data you might want to check out the Large Text Compression Benchmark.
For other types, this might be indicative.

Resources