Low Level JPEG Encoding - image

I am trying to implement a JPEG encoder using the lowest possible operations. It is relatively doable up to Huffman encoding where most tutorials are using pointers and binary trees to make the table and encode the image. Can someone more familiar with the JPEG standard point me in the direction of the simplest compression technique that I should try to implement with low level operations (+, -, *, shifts, loops, if statements).
I have heard there are standard Huffman tables (I can't really find one), are these typically a good idea to use? After I use a standard huffman table, how simple is it to encode a 8x8 chunk with it? I stopped here, because I didn't want to go down a rabbit hole.

The standard will answer all of your questions. There are "typical" Huffman codes provided there in appendix K.3, along with the specifications for those tables. You can just hard-code those into your implementation and you will get good performance.
Since the Huffman codes are given, you would not need to implement the Huffman algorithm, which is what needs the pointers and trees you are referring to. (Not terribly complicated, so it is something you should tackle later, once you get the pre-defined codes working.)
You can just use the code table and a set of operators, which also need to include bitwise or (|) and bitwise and (&) along with the others in your list. The encoding process is simple, which is to have a bit buffer in an integer into which you accumulate code bits using the shift and or operators, and then pulling bytes from the buffer to write to the output when there are eight or more bits in the buffer.

Related

Efficient way to encode bit-vectors?

Currently using the run length encoding for encoding bit-vectors, and the current run time is
2log(i), where is the size of the run. Is there another way of doing it to bring it down to log(i)?
Thanks.
The most efficient way of encoding a bit vector is to isolate any specific properties of the bit source. If it is totally random, there is no real noticeable gain (actually, a totally random stream of bit cannot be compressed in any way).
If you can find properties in your bit stream you could try to define a collection of vectors which will define the base of a Vector Space. In such case, the result will be very efficient.
We'll need a few more details on your bit stream.
(Edit)
Just a few more details to understand the previous statement:
"a totally random stream of bits cannot be compressed in any way"
It is not possible to compress a totally random vector of bits if by "compress" we mean the "transformed/compressed stream" plus the "vector base definition" plus the decompression program. But in most cases the decompression program (and often the vector base too) is embedded in client software. Thus, only the "compressed stream" is needed.
A good explanation (and funny story) about that is Patrick Craig 5000$ compression challenge
More scientific the theory of information, especially entropy section
And, the final one, the full story.
But whatever the solution is, if you have an unknown number of unknown streams to compress you won't be ale to do anything. You have to find a pattern.

Papers on fast validation of UTF-8

Are there any papers on state of the art UTF-8 validators/decoders. I've seen implementations "in the wild" that use clever loops that process up to 8 bytes per iteration in common cases (e.g. all 7-bit ASCII input).
I don't know about papers, it' probably a bit too specific and narrow a subject for strictly scientific analysis but rather an engineering problem. You can start by looking at how this is handled different libraries. Some solutions will use language-specific tricks while others are very general. For Java, you can start with the code of UTF8ByteBufferReader, a part of Javolution. I have found this to be much faster than the character set converters built into the language. I believe (but I'm not sure) that the latter use a common piece of code for many encodings and encoding-specific data files. Javolution in contrast has code designed specifically for UTF-8.
There are also some techniques used for specific tasks, for example if you only need to calculate how many bytes a UTF-8 character takes as you parse the text, you can use a table of 256 values which you index by the first byte of the UTF-8 encoded character and this way of skipping over characters or calculating a string's length in characters is much faster than using bit operations and conditionals.
For some situations, e.g. if you can waste some memory and if you now that most characters you encounter will be from the Basic Multilingual Plane, you could try even more aggressive lookup tables, for example first calculate the length in bytes by the method described above and if it's 1 or 2 bytes (maybe 3 makes sense too), look up the decoded char in a table. Remember, however, to benchmark this and any other algorithm you try, as it need not be faster at all (bit operations are quite fast, and with a big lookup table you loose locality of reference plus the offset calculation isn't completely free, either).
Any way, I suggest you start by looking at the Javolution code or another similar library.

Compression algorithms for a sequence of integers

Are there any good compression algorithms for a large sequence of integers (A/D converter data). There is similar question
But the data is different in my case. It can be negarive or positive and changing like wave data.
EDIT1:sample data added
Please refer to this file for a data sample
Generally if you have some knowledge about the signal, use it to predict next value basing on previous ones. Then - compress difference between predicted and real value.
If prediction is good, differences will be small and their compressing will be good.
Anything more specific is unlikely possible without seeing the data and knowing about its physical nature.
update:
If the prediction is really well and uses all knowledge about dependencies, the differences are likely to be independent and something like arithmetic encoding would work for them.
You want a Delta Encode and then you want to apply a RLE or a Golomb Code. The Golomb Code can be as good as a Huffman Code.
Nearly any standard compression algorithm for byte strings can be applied; after all, any file of data can be interpreted as a sequence of signed integers. Is there something special about your particular integers that you think will make them amenable to some more-specific algorithm? You mention wave data; maybe take a look at FLAC which is designed for audio data; if your data has similar characteristics those techniques may be valuable.
You could diff the data then apply RLE on suitable subregions (i.e. between inflection points).

Where can I find a lossless compression algorithm, which produces headerless outputs?

Does anyone of you know a lossless compression algorithm, which produces headerless outputs?
For example do not store the huffman tree used to compress it? I do not speak about hard coded huffman trees, but I like to know if there is any algorithm that can compress and decompress input without storing some metadata in its output. Or is this even theoretically impossible?
Of course it is posible. Among others, the LZ family of compressors don't need to output anything apart from the compressed data itself, as the dictionary is built on-line as compression (or decompression) progress. You have a lot of reference implementations for those LZ-type algorithms. For example, LZMA, component of 7zip.
Adaptive Huffman coding does exactly that. More generally, the term adaptive coding is used to describe entropy codes with this property. Some dictionary codes have this property too, e.g. run-length encoding (RLE) and Lempel-Ziv-Welch (LZW).
Run Length Encoding would be one example
lzo springs to mind. it's used in OpenVPN, with great results
Why are you looking for compression algorithms with headerless compressed output?
Perhaps (a) you have a system like 2-way telephony that needs low-latency streaming compression/decompression.
The adaptive coding category of compression algorithms mentioned by Zach Scrivena
and the LZ family of dictionary compression algorithms mentioned by Diego Sevilla and Javier
are excellent for this kind of application.
Practical implementations of these algorithms usually do have a byte or two of metadata
at the beginning (making them useless for (b) applications), but that has little or no effect on latency.
Perhaps (b) you are mainly interested in cryptography, and you hear that compress-before-encrypt gives some improved security properties, as long as the compressed text does not have fixed metadata header "crib".
Modern encryption algorithms aren't (as far as we know) vulnerable to such "cribs", but if you're paranoid you might be interested in
"bijective compression" (a, b, c, etc.).
It's not possible to detect errors in transmission (flipped bits, inserted bits, deleted bits, etc.) when a receiver gets such compressed output (making these algorithms not especially useful for (a) applications).
Perhaps (c) you are interested in headerless compression for some other reason. Sounds fascinating -- what is that reason?

What are some alternatives to a bit array?

I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array (java.util.BitSet), so each of my bit arrays takes several megabytes.
My plan is to look at the cardinality of the first N bits, then make a decision about what data structure to use for the remainder. Clearly some data structures are better for very sparse bit arrays, and others when roughly half the bits are set (when most bits are set, I can use negation to treat it as a sparse set of zeroes).
What structures might be good at each extreme?
Are there any in the middle?
Here are a few constraints or hints:
The bits are set only once, and in index order.
I need 100% accuracy, so something like a Bloom filter isn't good enough.
After the set is built, I need to be able to efficiently iterate over the "set" bits.
The bits are randomly distributed, so run-length–encoding algorithms aren't likely to be much better than a simple list of bit indexes.
I'm trying to optimize memory utilization, but speed still carries some weight.
Something with an open source Java implementation is helpful, but not strictly necessary. I'm more interested in the fundamentals.
Unless the data is truly random and has a symmetric 1/0 distribution, then this simply becomes a lossless data compression problem and is very analogous to CCITT Group 3 compression used for black and white (i.e.: Binary) FAX images. CCITT Group 3 uses a Huffman Coding scheme. In the case of FAX they are using a fixed set of Huffman codes, but for a given data set, you can generate a specific set of codes for each data set to improve the compression ratio achieved. As long as you only need to access the bits sequentially, as you implied, this will be a pretty efficient approach. Random access would create some additional challenges, but you could probably generate a binary search tree index to various offset points in the array that would allow you to get close to the desired location and then walk in from there.
Note: The Huffman scheme still works well even if the data is random, as long as the 1/0 distribution is not perfectly even. That is, the less even the distribution, the better the compression ratio.
Finally, if the bits are truly random with an even distribution, then, well, according to Mr. Claude Shannon, you are not going to be able to compress it any significant amount using any scheme.
I would strongly consider using range encoding in place of Huffman coding. In general, range encoding can exploit asymmetry more effectively than Huffman coding, but this is especially so when the alphabet size is so small. In fact, when the "native alphabet" is simply 0s and 1s, the only way Huffman can get any compression at all is by combining those symbols -- which is exactly what range encoding will do, more effectively.
Maybe too late for you, but there is a very fast and memory efficient library for sparse bit arrays (lossless) and other data types based on tries. Look at Judy arrays
Thanks for the answers. This is what I'm going to try for dynamically choosing the right method:
I'll collect all of the first N hits in a conventional bit array, and choose one of three methods, based on the symmetry of this sample.
If the sample is highly asymmetric,
I'll simply store the indexes to the
set bits (or maybe the distance to
the next bit) in a list.
If the sample is highly symmetric,
I'll keep using a conventional bit
array.
If the sample is moderately
symmetric, I'll use a lossless
compression method like Huffman
coding suggested by
InSciTekJeff.
The boundaries between the asymmetric, moderate, and symmetric regions will depend on the time required by the various algorithms balanced against the space they need, where the relative value of time versus space would be an adjustable parameter. The space needed for Huffman coding is a function of the symmetry, and I'll profile that with testing. Also, I'll test all three methods to determine the time requirements of my implementation.
It's possible (and actually I'm hoping) that the middle compression method will always be better than the list or the bit array or both. Maybe I can encourage this by choosing a set of Huffman codes adapted for higher or lower symmetry. Then I can simplify the system and just use two methods.
One more compression thought:
If the bit array is not crazy long, you could try applying the Burrows-Wheeler transform before using any repetition encoding, such as Huffman. A naive implementation would take O(n^2) memory during (de)compression and O(n^2 log n) time to decompress - there are almost certainly shortcuts to be had, as well. But if there's any sequential structure to your data at all, this should really help the Huffman encoding out.
You could also apply that idea to one block at a time to keep the time/memory usage more practical. Using one block at time could allow you to always keep most of the data structure compressed if you're reading/writing sequentially.
Straight forward lossless compression is the way to go. To make it searchable you will have to compress relatively small blocks and create an index into an array of the blocks. This index can contain the bit offset of the starting bit in each block.
Quick combinatoric proof that you can't really save much space:
Suppose you have an arbitrary subset of n/2 bits set to 1 out of n total bits. You have (n choose n/2) possibilities. Using Stirling's formula, this is roughly 2^n / sqrt(n) * sqrt(2/pi). If every possibility is equally likely, then there's no way to give more likely choices shorter representations. So we need log_2 (n choose n/2) bits, which is about n - (1/2)log(n) bits.
That's not a very good savings of memory. For example, if you're working with n=2^20 (1 meg), then you can only save about 10 bits. It's just not worth it.
Having said all that, it also seems very unlikely that any really useful data is truly random. In case there's any more structure to your data, there's probably a more optimistic answer.

Resources