Text Compression - What algorithm to use

Text Compression - What algorithm to use - algorithm

I need to compress some text data of the form
[70,165,531,0|70,166,562|"hi",167,578|70,171,593|71,179,593|73,188,609|"a",1,3|
The data contains a few thousand characters(10000 - 50000 approx).
I read upon the various compression algorithms, but cannot decide which one to use here.
The important thing here is : The compressed string should contain only alphanumberic characters(or a few special characters like +-/&%#$..) I mean most algorithms provide gibberish ascii characters as compressed data right? That must be avoided.
Can someone guide me on how to proceed here?
P.S The text contains numbers , ' and the | character predominantly. Other characters occur very very rarely.

Actually your requirement to limit the output character set to printable characters automatically costs you 25% of your compression gain, as out of 8 bits per by you'll end up using roughly 6.
But if that's what you really want, you can always base64 or the more space efficient base85 the output to reconvert the raw bytestream to printable characters.
Regarding the compression algorithm itself, stick to one of the better known ones like gzip or bzip2, for both well tested open source code exists.
Selecting "the best" algorithm is actually not that easy, here's an excerpt of the list of questions you have to ask yourself:
do i need best speed on the encoding or decoding side (eg bzip is quite asymmetric)
how important is memory efficiency both for the encoder and the decoder? Could be important for embedded applications
is the size of the code important, also for embedded
do I want pre existing well tested code for encoder or decorder or both only in C or also in another language
and so on
The bottom line here is probably, take a representative sample of your data and run some tests with a couple of existing algorithms, and benchmark them on the criteria that are important for your use case.

Just one thought: You can solve your two problems independently. Use whatever algorithm gives you the best compression (just try out a few on your kind of data. bz2, zip, rar -- whatever you like, and check the size), and then to get rid of the "gibberish ascii" (that's actually just bytes there...), you can encode your compressed data with Base64.
If you really put much thought into it, you might find a better algorithm for your specific problem, since you only use a few different chars, but if you stumble upon one, I think it's worth a try.

Related

Algorithm to compress a lot of small strings?

I am looking for an algorithm to compress small ASCII strings. They contain lots of letters but they also can contain numbers and rarely special characters. They will be small, about 50-100 bytes average, 250 max.
Examples:
Android show EditText.setError() above the EditText and not below it
ImageView CENTER_CROP dont work
Prevent an app to show on recent application list on android kitkat 4.4.2
Image can't save validable in android
Android 4.4 SMS - Not receiving sentIntents
Imported android-map-extensions version 2.0 now my R.java file is missing
GCM registering but not receiving messages on pre 4.0.4. devices
I want to compress the titles one by one, not many titles together and I don't care much about CPU and memory usage.

You can use Huffman coding with a shared Huffman tree among all texts you want to compress.
While you typically construct a Huffman tree for each string to be compressed separately, this would require a lot of overhead in storage which should be avoided here. That's also the major problem when using a standard compression scheme for your case: most of them have some overhead which kills your compression efficiency for very short strings. Some of them don't have a (big) overhead but those are typically less efficient in general.
When constructing a Huffman tree which is later used for compression and decompression, you typically use the texts which will be compressed to decide which character is encoded with which bits. Since in your case the texts to be compressed seem to be unknown in advance, you need to have some "pseudo" texts to build the tree, maybe from a dictionary of the human language or some experience of previous user data.
Then construct the Huffman tree and store it once in your application; either hardcode it into the binary or provide it in the form of a file. Then you can compress and decompress any texts using this tree. Whenever you decide to change the tree since you gain better experience on which texts are compressed, the compressed string representation also changes. It might be a good idea to introduce versioning and store the tree version together with each string you compress.
Another improvement you might think about is to use multi-character Huffman encoding. Instead of compressing the texts character by character, you could find frequent syllables or words and put them into the tree too; then they require even less bits in the compressed string. This however requires a little bit more complicated compression algorithm, but it might be well worth the effort.
To process a string of bits in the compression and decompression routine in C++(*), I recommend either boost::dynamic_bitset or std::vector<bool>. Both internally pack multiple bits into bytes.
(*)The question once had the c++ tag, so OP obviously wanted to implement it in C++. But as the general problem is not specific to a programming language, the tag was removed. But I still kept the C++-specific part of the answer.

Text Compression Algorithm

I am just wondering if someone could introduce me any algorithm that compresses Unicode text to 10-20 percent of its original size ?
actually I've read Lempel-Ziv compression algorithm which reduces size of text to 60% of original size, but I've heard that there are some algorithms with this performance

If You are considering only text compression than the very first algorithm that uses entropy based encryption called Huffman Encoding
Huffman Coding
Then there is LZW compression which uses a dictionary encoding to use previously used sequence of letter to assign codes to reduce size of file.
LZW compression
I think above two are sufficient for encoding text data efficiently and are easy to implement.
Note: Donot expect good compression on all files, If data is random with no pattern than no compression algorithm can give you any compression at all. Percentage of compression depends on symbols appearing in the file not only on the algorithm used.

LZ-like coders are not any good for text compression.
The best one for direct use with unicode would be lzma though, as it has position alignment options. (http://www.7-zip.org/sdk.html)
But for best compression, I'd suggest to convert unicode texts to a bytewise format,
eg. utf8, and then use an algorithm with known good results on texts, eg.
BWT (http://libbsc.com) or PPMd (http://compression.ru/ds/ppmdj1.rar).
Also some preprocessing can be applied to improve results of text compression
(see http://xwrt.sourceforge.net/)
And there're some compressors with even better ratio than suggested ones
(mostly paq derivatives), but they're also much slower.
Here I tested various representations of russian translation of
Witten's "Modeling for text compression":
7z rar4 paq8px69
modeling_win1251.txt 156091 50227 42906 36254
modeling_utf16.txt 312184 52523 50311 38497
modeling_utf8.txt 238883 53793 44231 37681
modeling_bocu.txt 165313 53073 44624 38768
modeling_scsu.txt 156261 50499 42984 36485
It shows that longer input doesn't necessarily mean better overall compression,
and that SCSU, although useful, isn't really the best representation of unicode text
(win1251 codepage is one, too).

PAQ is the new reigning champion of text compression...There are a few different flavors and information about them can be found here.
There are three flavors that I recommend:
ZPAQ - Future facing container for PAQ algorithims (created to make the future of PAQ easier)
PAQ8PX/PAQ8KX - The most powerful, works with EXE and WAV files as well.
PAQ8PF - Faster (both compression and decompression) and mostly intended for TXT files
You have to build them yourself from source, fortunately someone made a GUI, FrontPAQ, that packages the two best binary into one.
Once you have a functional binary its simple to use, the documentation can be found here.
Note: I am aware this is a very old question, but I wish to include relevant modern data. I came looking for the same question, yet have found a newer more powerful answer.

Efficient way to encode bit-vectors?

Currently using the run length encoding for encoding bit-vectors, and the current run time is
2log(i), where is the size of the run. Is there another way of doing it to bring it down to log(i)?
Thanks.

The most efficient way of encoding a bit vector is to isolate any specific properties of the bit source. If it is totally random, there is no real noticeable gain (actually, a totally random stream of bit cannot be compressed in any way).
If you can find properties in your bit stream you could try to define a collection of vectors which will define the base of a Vector Space. In such case, the result will be very efficient.
We'll need a few more details on your bit stream.
(Edit)
Just a few more details to understand the previous statement:
"a totally random stream of bits cannot be compressed in any way"
It is not possible to compress a totally random vector of bits if by "compress" we mean the "transformed/compressed stream" plus the "vector base definition" plus the decompression program. But in most cases the decompression program (and often the vector base too) is embedded in client software. Thus, only the "compressed stream" is needed.
A good explanation (and funny story) about that is Patrick Craig 5000$ compression challenge
More scientific the theory of information, especially entropy section
And, the final one, the full story.
But whatever the solution is, if you have an unknown number of unknown streams to compress you won't be ale to do anything. You have to find a pattern.

Compression algorithms for a sequence of integers

Are there any good compression algorithms for a large sequence of integers (A/D converter data). There is similar question
But the data is different in my case. It can be negarive or positive and changing like wave data.
EDIT1:sample data added
Please refer to this file for a data sample

Generally if you have some knowledge about the signal, use it to predict next value basing on previous ones. Then - compress difference between predicted and real value.
If prediction is good, differences will be small and their compressing will be good.
Anything more specific is unlikely possible without seeing the data and knowing about its physical nature.
update:
If the prediction is really well and uses all knowledge about dependencies, the differences are likely to be independent and something like arithmetic encoding would work for them.

You want a Delta Encode and then you want to apply a RLE or a Golomb Code. The Golomb Code can be as good as a Huffman Code.

Nearly any standard compression algorithm for byte strings can be applied; after all, any file of data can be interpreted as a sequence of signed integers. Is there something special about your particular integers that you think will make them amenable to some more-specific algorithm? You mention wave data; maybe take a look at FLAC which is designed for audio data; if your data has similar characteristics those techniques may be valuable.

You could diff the data then apply RLE on suitable subregions (i.e. between inflection points).

How does PDF417 barcode decoding recover from damaged labels?

I recently learned about PDF417 barcodes and I was astonished that I can still read the barcode after I ripped it in half and scanned only a fragment of the original label.
How can the barcode decoding be that robust? Which (types of) algorithms are used during encoding and decoding?
EDIT: I understand the general philosophy of introducing redundancy to create robustness, but I'm interested in more details, i.e. how this is done with PDF417.

the pdf417 format allows for varying levels of duplication/redundancy in its content. the level of redundancy used will affect how much of the barcode can be obscured or removed while still leaving the contents readable

PDF417 does not use anything. It's a specification of encoding of data.
I think there is a confusion between the barcode format and the data it conveys.
The various barcode formats (PDF417, Aztec, DataMatrix) specify a way to encode data, be it numerical, alphabetic or binary... the exact content though is left unspecified.
From what I have seen, Reed-Solomon is often the algorithm used for redundancy. The exact level of redundancy is up to you with this algorithm and there are libraries at least in Java and C from what I've been dealing with.
Now, it is up to you to specify what the exact content of your barcode should be, including the algorithm used for redundancy and the parameters used by this algorithm. And of course you'll need to work hand in hand with those who are going to decode it :)
Note: QR seems slightly different, with explicit zones for redundancy data.

I don't know the PDF417. I know that QR codes use Reed Solomon correction. It is an oversampling technique. To get the concept: suppose you have a polynomial in the power of 6. Technically, you need seven points to describe this polynomial uniquely, so you can perfectly transmit the information about the whole polynomial with just seven points. However, if one of these seven is corrupted, you miss the information whole. To work around this issue, you extract a larger number of points out of the polynomial, and write them down. As long as you have at least seven out of the bunch, it will be enough to reconstruct your original information.
In other words, you trade space for robustness, by introducing more and more redundancy. Nothing new here.

I do not think the concept of trade off between space and robustness is any different here as anywhere else. Think RAID, let's say RAID 5 - you can yank a disk out of the array and the data is still available. The price? - an extra disk. Or in terms of the barcode - extra space the label occupies

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio