I am looking for an algorithm that could compress binary data, not to smallest size, but to specified size. For example, if I have uncompressed data that comes in various 1.2, 1.3, 1.4 ... KB sizes, I would specify "compress to 1 KB" and assuming the data could be compressed to .65, .74, .83 KB sizes, the algorithm would stop at 1 KB, and return this standard size, leaving some entropy in the data. I could not pinpoint an algorithm for that. Does one exist?
You can ZIP and pad with zeroes but still in some case the data is highly random and even very efficient compression algorithms cannot compress the data because there is no correlation between data hence you dont get any compression at all so getting compression up to a specific size is not possible.
It is not possible.
Proof: Some combination of data can't be losslessly compressed, so that case fails your need to meet a fixed size, assuming that size is smaller than the original data.
If you can accept lossy compression it is very possible and happens all the time for some video formats (fixed size container per unit of time; compression adjusts to maximize usage).
Related
How much file that contains 1000 bits, where 1 appears with a 10% probability of 0 - 90% probability can be compressed with Huffman code?
Maybe a factor of two.
But only if you do not include the overhead of sending the description of the Huffman code along with the data. For 1000 bits, that overhead will dominate the problem, and determine your maximum compression ratio. I find that for that small of a sample, 125 bytes, general-purpose compressors get it down to only around 100 to 120 bytes, due to the overhead.
A custom Huffman code just for this on bytes from such a stream gives a factor of 2.10, assuming the other side already knows the code. The best you could hope for is the entropy, e.g. with an arithmetic code, which gives 2.13.
I'm currently working on implementing LZW compression and decompression methods from FFmpeg source code to my project. What i stumbled upon is that the size of output buffer (where compressed data will be stored) needs to be bigger than size of input buffer that we want to compress. Isn't that contradictionary to the compression itself?
Next part of the code is located in ff_lzw_encode() function which is part of lzwenc.c source file.
if (insize * 3 > (s->bufsize - s->output_bytes) * 2)
{
printf("Size of output buffer is too small!\n");
return -1;
}
For my particular example, i'm trying to compress raw video frames before sending them locally. But if i allocate memory for a buffer that is size of (insize * 3) / 2 (where compressed data will be stored), wouldn't that take more time to send using send() function than sending raw buffer which is size of insize?
You cannot guarantee that the 'compressed' form is of less than or even equal size as the input. Think about the worst case of purely random data which cannot be compressed in any way and, best case, will be compressed to 100% its original size; in addition to that some compression metadata or escape sequences will need to be added resulting in e.g. 100% + 5 bytes.
In fact, 'compressing' incompressible data to "only" 100% it's original size is usually not happening automatically. If the algorithm just tries to compress the input normally, the result may even be significantly larger than the input. Smart compression tools detect this situation and fall back to send that chunk of data uncompressed instead, then adding some metadata to at least indicate that the chunk is uncompressed.
The buffer you have allocated must be large enough to contain the worst case number of 'compressed' bytes, hence the need for some 'headroom'.
wouldn't that take more time to send using send() function than
sending raw buffer
Yes, it would. That's why you don't send the whole (allocated) buffer but only as many bytes from that buffer as the compression function indicates it has used.
One byte is used to store each of the three color channels in a pixel. This gives 256 different levels each of red, green and blue. What would be the effect of increasing the number of bytes per channel to 2 bytes?
2^16 = 65536 values per channel.
The raw image size doubles.
Processing the file takes roughly 2 times more time ("roughly", because you have more data, but then again this new data size may be better suited for your CPU and/or memory alignment than the previous sections of 3 bytes -- "3" is an awkward data size for CPUs).
Displaying the image on a typical screen may take more time (where "a typical screen" is 24- or 32-bit and would as yet not have hardware acceleration for this particular job).
Chances are you cannot use the original data format to store the image back into. (Currently, TIFF is the only file format I know that routinely uses 16 bits/channel. There may be more. Can yours?)
The image quality may degrade. (If you add bytes you cannot set them to a sensible value. If 3 bytes of 0xFF signified 'white' in your original image, what would be the comparable 16-bit value? 0xFFFF, or 0xFF00? Why? (For either choice-- and remember, you have to make a similar choice for black.))
Common library routines may stop working correctly. Only the very best libraries are data size-ignorant (and they'd still need to be rewritten to make use of this new size.)
If this is a real world scenario -- say, I just finished writing a fully antialiased graphics 2D library, and then my boss offhandedly adds this "requirement" -- it'd have a particular graphic effect on me as well.
i have a webhosting that gives maximum memory_limit of 80M (i.e. ini_set("memory_limit","80M");).
I'm using photo upload that uses the function imagecreatefromjpeg();
When i upload large images it gives the error
"Fatal error: Allowed memory size of 83886080 bytes exhausted"
What maximum size (in bytes) for the image i can restrict to the users?
or the memory_limit depends on some other factor?
The memory size of 8388608 is 8 Megabytes, not 80. You may want to check whether you can still increase the value somewhere.
Other than that, the general rule for image manipulation is that it will take at least
image width x image height x 3
bytes of memory to load or create an image. (One byte for red, one byte for green, one byte for blue, possibly one more for alpha transparency)
By that rule, a 640 x 480 pixel image will need at least 9.2 Megabytes of space - not including overhead and space occupied by the script itself.
It's impossible to determine a limit on the JPG file size in bytes because JPG is a compressed format with variable compression rates. You will need to go by image resolution and set a limit on that.
If you don't have that much memory available, you may want to look into more efficient methods of doing what you want, e.g. tiling (processing one part of an image at a time) or, if your provider allows it, using an external tool like ImageMagick (which consumes memory as well, but outside the PHP script's memory limit).
Probably your script uses more memory than the just the image itself. Trying debugging your memory consumption.
One quick-and-dirty way is to utilize memory_get_usage and memory_get_usage memory_get_peak_usage on certain points in your code and especially in a custom error_handler and shutdown_function. This can let you know what exact operations causes the memory exhaustion.
I am trying to calculate an initial buffer size to use when decompressing data of an unknown size. I have a bunch of data points from existing compression streams but don't know the best way to analyze them.
Data points are the compressed size and the ratio to uncompressed size.
For example:
100425 (compressed size) x 1.3413 (compression ratio) = 134,700 (uncompressed size)
The compressed data stream doesn't store the uncompressed size so the decompressor has to alloc an initial buffer size and realloc if it overflows. I'll looking for the "best" initial size to alloc the buffer given the compressed size. I have over 293,000 data points.
Given that you have a lot of data points of how your compression works, I'd recommend analyzing your compression data, to get a mean compression standard and a standard deviation. Then, I'd recommend setting your buffer size initially to your original size * your compression size at 2 standard deviations above the mean; this will mean that your buffer is the right size for 93% of your cases. If you want your buffer to not need reallocation for more cases, increase the number of standard deviations above the mean that you're allocating for.
One simple method is to use a common initial decompression buffer size and double the size at each realloc. This is also used in many dynamic libraries.