Efficient way to encode bit-vectors? - performance

Currently using the run length encoding for encoding bit-vectors, and the current run time is
2log(i), where is the size of the run. Is there another way of doing it to bring it down to log(i)?
Thanks.

The most efficient way of encoding a bit vector is to isolate any specific properties of the bit source. If it is totally random, there is no real noticeable gain (actually, a totally random stream of bit cannot be compressed in any way).
If you can find properties in your bit stream you could try to define a collection of vectors which will define the base of a Vector Space. In such case, the result will be very efficient.
We'll need a few more details on your bit stream.
(Edit)
Just a few more details to understand the previous statement:
"a totally random stream of bits cannot be compressed in any way"
It is not possible to compress a totally random vector of bits if by "compress" we mean the "transformed/compressed stream" plus the "vector base definition" plus the decompression program. But in most cases the decompression program (and often the vector base too) is embedded in client software. Thus, only the "compressed stream" is needed.
A good explanation (and funny story) about that is Patrick Craig 5000$ compression challenge
More scientific the theory of information, especially entropy section
And, the final one, the full story.
But whatever the solution is, if you have an unknown number of unknown streams to compress you won't be ale to do anything. You have to find a pattern.

Related

Low Level JPEG Encoding

I am trying to implement a JPEG encoder using the lowest possible operations. It is relatively doable up to Huffman encoding where most tutorials are using pointers and binary trees to make the table and encode the image. Can someone more familiar with the JPEG standard point me in the direction of the simplest compression technique that I should try to implement with low level operations (+, -, *, shifts, loops, if statements).
I have heard there are standard Huffman tables (I can't really find one), are these typically a good idea to use? After I use a standard huffman table, how simple is it to encode a 8x8 chunk with it? I stopped here, because I didn't want to go down a rabbit hole.
The standard will answer all of your questions. There are "typical" Huffman codes provided there in appendix K.3, along with the specifications for those tables. You can just hard-code those into your implementation and you will get good performance.
Since the Huffman codes are given, you would not need to implement the Huffman algorithm, which is what needs the pointers and trees you are referring to. (Not terribly complicated, so it is something you should tackle later, once you get the pre-defined codes working.)
You can just use the code table and a set of operators, which also need to include bitwise or (|) and bitwise and (&) along with the others in your list. The encoding process is simple, which is to have a bit buffer in an integer into which you accumulate code bits using the shift and or operators, and then pulling bytes from the buffer to write to the output when there are eight or more bits in the buffer.

Text Compression - What algorithm to use

I need to compress some text data of the form
[70,165,531,0|70,166,562|"hi",167,578|70,171,593|71,179,593|73,188,609|"a",1,3|
The data contains a few thousand characters(10000 - 50000 approx).
I read upon the various compression algorithms, but cannot decide which one to use here.
The important thing here is : The compressed string should contain only alphanumberic characters(or a few special characters like +-/&%#$..) I mean most algorithms provide gibberish ascii characters as compressed data right? That must be avoided.
Can someone guide me on how to proceed here?
P.S The text contains numbers , ' and the | character predominantly. Other characters occur very very rarely.
Actually your requirement to limit the output character set to printable characters automatically costs you 25% of your compression gain, as out of 8 bits per by you'll end up using roughly 6.
But if that's what you really want, you can always base64 or the more space efficient base85 the output to reconvert the raw bytestream to printable characters.
Regarding the compression algorithm itself, stick to one of the better known ones like gzip or bzip2, for both well tested open source code exists.
Selecting "the best" algorithm is actually not that easy, here's an excerpt of the list of questions you have to ask yourself:
do i need best speed on the encoding or decoding side (eg bzip is quite asymmetric)
how important is memory efficiency both for the encoder and the decoder? Could be important for embedded applications
is the size of the code important, also for embedded
do I want pre existing well tested code for encoder or decorder or both only in C or also in another language
and so on
The bottom line here is probably, take a representative sample of your data and run some tests with a couple of existing algorithms, and benchmark them on the criteria that are important for your use case.
Just one thought: You can solve your two problems independently. Use whatever algorithm gives you the best compression (just try out a few on your kind of data. bz2, zip, rar -- whatever you like, and check the size), and then to get rid of the "gibberish ascii" (that's actually just bytes there...), you can encode your compressed data with Base64.
If you really put much thought into it, you might find a better algorithm for your specific problem, since you only use a few different chars, but if you stumble upon one, I think it's worth a try.

Reverse "jpeg" compression algorithm?

I have to write a tool that manages very large data sets (well, large for an ordinary workstations). I need basically something that works the opposite that the jpeg format. I need the dataset to be intact on disk where it can be arbitrarily large, but then it needs to be lossy compressed when it gets read in memory and only the sub-part used at any given time need to be uncompressed on the flight. I have started looking at ipp (Intel Integrated Performance Primitives) but it's not really clear for now if I can use them for what I need to do.
Can anyone point me in the right direction?
Thank you.
Given the nature of your data, it seems you are handling some kind of raw sample.
So the easiest and most generic "lossy" technique will be to drop the lower bits, reducing precision, up to the level you want.
Note that you will need to "drop the lower bits", which is quite different from "round to the next power of 10". Computer work on base 2, and you want all your lower bits to be "00000" for compression to perform as well as possible. This method suppose that the selected compression algorithm will make use of the predictable 0-bits pattern.
Another method, more complex and more specific, could be to convert your values as an index into a table. The advantage is that you can "target" precision where you want it. The obvious drawback is that the table will be specific to a distribution pattern.
On top of that, you may also store not the value itself, but the delta of the value with its preceding one if there is any kind of relation between them. This will help compression too.
For data to be compressed, you will need to "group" them by packets of appropriate size, such as 64KB. On a single field, no compression algorithm will give you suitable results. This, in turn, means that each time you want to access a field, you need to decompress the whole packet, so better tune it depending on what you want to do with it. Sequential access is easier to deal with in such circumstances.
Regarding compression algorithm, since these data are going to be "live", you need something very fast, so that accessing the data has very small latency impact.
There are several open-source alternatives out there for that use. For easier license management, i would recommend a BSD alternative. Since you use C++, the following ones look suitable :
http://code.google.com/p/snappy/
and
http://code.google.com/p/lz4/

Good compression algorithm for small chunks of data? (around 2k in size)

I have a system with one machine generate small chunks of data in the form of objects containing arrays of integers and longs. These chunks get passed to another server which in turn distributes them elsewhere.
I want to compress these objects so the memory load on the pass-through server is reduced. I understand that compression algorithms like deflate need to build a dictionary so something like that wouldn't really work on data this small.
Are there any algorithms that could compress data like this efficiently?
If not, another thing I could do is batch these chunks into arrays of objects and compress the array once it gets to be a certain size. But I am reluctant to do this because I would have to change interfaces in an existing system. Compressing them individually would not require any interface changes, the way this is all set up.
Not that I think it matters, but the target system is Java.
Edit: Would Elias gamma coding be the best for this situation?
Thanks
If you think that reducing your data packet to its entropy level is at best as it can be, you can try a simple huffman compression.
For an early look at how well this would compress, you can pass a packet through Huff0 :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
It is a simple 0-order huffman encoder. So the result will be representative.
For more specific ideas on how to efficiently use the characteristics of your data, it would be advised to describe a bit what data the packets contains and how it is generated (as you have done in the comments, so they are ints (4 bytes?) and longs (8 bytes?)), and then provide one or a few samples.
It sounds like you're currently looking at general-purpose compression algorithms. The most effective way to compress small chunks of data is to build a special-purpose compressor that knows the structure of your data.
The important thing is that you need to match the coding you use with the distribution of values you expect from your data: to get a good result from Elias gamma coding, you need to make sure the values you code are smallish positive integers...
If different integers within the same block are not completely independent (e.g., if your arrays represent a time series), you may be able to use this to improve your compression (e.g., the differences between successive values in a time series tend to be smallish signed integers). However, because each block needs to be independently compressed, you will not be able to take this kind of advantage of differences between successive blocks.
If you're worried that your compressor might turn into an "expander", you can add an initial flag to indicate whether the data is compressed or uncompressed. Then, in the worst case where your data doesn't fit your compression model at all, you can always punt and send the uncompressed version; your worst-case overhead is the size of the flag...
Elias Gamma Coding might actually increase the size of your data.
You already have upper bounds on your numbers (whatever fits into a 4- or probably 8-byte int/long). This method encodes the length of your numbers, followed by your number (probably not what you want). If you get many small values, it might make things smaller. If you also get big values, it will probably increase the size (the 8-byte unsigned max value would become almost twice as big).
Look at the entropy of your data packets. If it's close to the maximum, compression will be useless. Otherwise, try different GP compressors. Tho I'm not sure if the time spent compressing and decompressing is worth the size reduction.
I would have a close look at the options of your compression library, for instance deflateSetDictionary() and the flag Z_FILTERED in http://www.zlib.net/manual.html. If you can distribute - or hardwire in the source code - an agreed dictionary to both sender and receiver ahead of time, and if that dictionary is representative of real data, you should get decent compression savings. Oops - in Java look at java.util.zip.Deflater.setDictionary() and FILTERED.

Compression algorithms for a sequence of integers

Are there any good compression algorithms for a large sequence of integers (A/D converter data). There is similar question
But the data is different in my case. It can be negarive or positive and changing like wave data.
EDIT1:sample data added
Please refer to this file for a data sample
Generally if you have some knowledge about the signal, use it to predict next value basing on previous ones. Then - compress difference between predicted and real value.
If prediction is good, differences will be small and their compressing will be good.
Anything more specific is unlikely possible without seeing the data and knowing about its physical nature.
update:
If the prediction is really well and uses all knowledge about dependencies, the differences are likely to be independent and something like arithmetic encoding would work for them.
You want a Delta Encode and then you want to apply a RLE or a Golomb Code. The Golomb Code can be as good as a Huffman Code.
Nearly any standard compression algorithm for byte strings can be applied; after all, any file of data can be interpreted as a sequence of signed integers. Is there something special about your particular integers that you think will make them amenable to some more-specific algorithm? You mention wave data; maybe take a look at FLAC which is designed for audio data; if your data has similar characteristics those techniques may be valuable.
You could diff the data then apply RLE on suitable subregions (i.e. between inflection points).

Resources