encoding most efficient way 64 character sequence for lesser writing time to memory - performance

The problem is as follows: Given a 64 charater sequences which is built from the english alphabet having 26 charcaters therefore just case characters, the occurrence distribution is such that any character has an equal chance of occurring at a given time.
Due to the fact that I have some computation which needs to be done with regards to the sequences, which requires writing to a text files, since the amount of sequences goes beyond a given ram. I thought of encoding a sequence such that I would be able to have lesser amount of bytes to write to a text file per given sequence.
With such reasoning I thought of the L-Z which would allow me to go down to 40 bytes. Is there any way by which i can go lower to encode a 64 character sequence?

With a large(-ish) lookup table you could encode each of the possible 26^64 character sequences in 301 (actually 300.8281==log2(26^64)) bits. This is slightly less than the 320 bits your straightforward compression would use. It is also the theoretical minimum given that any of the 26 characters occurs with equal probability.
Since you could derive the lookup table at any time you don't even need to store it. I suppose the bits used to represent the functions to encode a character string into a 301-bit integer and vice-versa ought to be counted into your compression ratio.
This is, of course, a long-winded restatement of #lhf's comment.

Related

How to build unique file names based on its content?

I plan to build a unique file name based on its content. For example, by its SHA256 hash. Files with the same content must have the same name.
The easiest way is to convert hash to a hex string. A file name will be 32 bytes length * 2 = 64 characters. This is pretty long name to operate with. How to make it shorter?
I implemented a sort of "Base32" coding - a vocabulary string that includes digits and 22 letters. I use only five bits of every byte to build file name with 32 characters. Much better.
I am looking for a balance between file name length and low collision probability. If the number of files is expected to be less than 500K, how long should the filename be? 8? 16? 24? 32?
Is there any recommended method to build short unique filenames at all?
If you use an N-bit cryptographic hash on M files, you can estimate the probability of at least one collision to be M2/2N+1
For 500K files, that's about 1/2N-37
Using base32, 16 chars gives probability of collision 1/243 -- a few trillion to 1 odds.
If that won't do, then 24 chars gives 1/283.
If you're willing to check them all and re-generate on collision, then 8 chars is fine.
Number of collisions depend on the content of the files, the hash-algorithm and the length of the hash.
In general: The longer the hash-value is the less likely are collisions (if your content does not especially provoke collisions).
You cannot avoid the possibility of collisions unless you use the content as file-name (or a lossless compression of it).
To shorten the filenames you could allow more different characters for the file-name. (But we aware what characters your OS allows and which you are willing to use).
I would go for a kind of base32 encoding to avoid problems with filesystems that do not distinguish between upper and lower case character.

Are both of these algorithms valid implementations of LZSS?

I am reverse engineering things and I often stumble upon various decompression algorithms. Most of times, it's LZSS just like Wikipedia describes it:
Initialize dictionary of size 2^n
While output is less than known output size:
Read flag
If the flag is set, output literal byte (and append it at the end of dictionary)
If the flag is not set:
Read length and look behind position
Transcribe length bytes from the dictionary at look behind position to the output and at the end of dictionary.
The thing is that the implementations follow two schools of how to encode the flag. The first one treats the input as sequence of bits:
(...)
Read flag as one bit
If it's set, read literal byte as 8 unaligned bits
If it's not set, read length and position as n and m unaligned bits
This involves lots of bit shift operations.
The other one saves a little CPU time by using bitwise operations only for flag storage, whereas literal bytes, length and position are derived from aligned input bytes. To achieve this, it breaks the linearity by fetching a few flags in advance. So the algorithm is modified like this:
(...)
Read 8 flags at once by reading one byte. For each of these 8 flags:
If it's set, read literal as aligned byte
If it's not set, read length and position as aligned bytes (deriving the specific values from the fetched bytes involves some bit operations, but it's nowhere as expensive as the first version.)
My question is: are these both valid LZSS implementations, or did I identify these algorithms wrong? Are there any known names for them?
They are effectively variants on LZSS, since all use one bit to decide on literal vs. match. More generally they are variants on LZ77.
Deflate is also a variant on LZ77, which does not use a whole bit for literal vs. match. Instead deflate has a single code for the combination of literals and lengths, so the code implicitly determines whether the next thing is a literal or a match. A length code is followed by a separate distance code.
lz4 (a specific algorithm, not a family) handles byte alignment in a different way, coding the number of literals, which is necessarily followed by a match. The first byte with the number of literals also has part of the distance. The literals are byte aligned, as is the offset that follows the literals and the rest of the distance.

Compress many numbers into a string

I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.

Why Huffman Coding is good?

I am not asking how Huffman coding is working, but instead, I want to know why it is good.
I have the following two questions:
Q1
I understand the ultimate purpose of Huffman coding is to give certain char a less bit number, so space is saved. What I don't understand is that why the decision of number of bits for a char can be related to the char's frequency?
Huffman Encoding Trees says
It is sometimes advantageous to use variable-length codes, in which
different symbols may be represented by different numbers of bits. For
example, Morse code does not use the same number of dots and dashes
for each letter of the alphabet. In particular, E, the most frequent
letter, is represented by a single dot.
So in Morse code, E can be represented by a single dot because it is the most frequent letter. But why? Why can it be a dot just because it is most frequent?
Q2
Why the probability / statistics of the chars are so important to Huffman coding?
What happen if the statistics table is wrong?
If you assign less number or bits or shorter code words for most frequently used symbols you will be saving a lot of storage space.
Suppose you want to assign 26 unique codes to English alphabet and want to store an english novel ( only letters ) in term of these code you will require less memory if you assign short length codes to most frequently occurring characters.
You might have observed that postal code and STD codes for important cities are usually shorter ( as they are used very often ). This is very fundamental concept in Information theory.
Huffman encoding gives prefix codes.
Construction of Huffman tree:
A greedy approach to construct Huffman tree for n characters is as follows:
places n characters in n sub-trees.
Starts by combining the two least weight nodes into a tree which is assigned the sum of the two leaf node weights as the weight for its root node.
Do this until you get a single tree.
For example consider below binary tree where E and T have high weights ( as very high occurrence )
It is a prefix tree. To get the Huffman code for any character, start from the node corresponding to the the character and backtrack till you get the root node.
Indeed, an E could be, say, three dashes followed by two dots. When you make your own encoding, you get to decide. If your goal is to encode a certain text so that the result is as short as possible, you should choose short codes for the most frequent characters. The Huffman algorithm ensures that we get the optimal codes for a specific text.
If the frequency table is somehow wrong, the Huffman algorithm will still give you a valid encoding, but the encoded text would be longer than it could have been if you had used a correct frequency table. This is usually not a problem, because we usually create the frequency table based on the actual text that is to be encoded, so the frequency table will be "perfect" for the text that we are going to encode.
well.. you want assign shorter codes to the symbols which appear more frequently... huffman encoding works just by this simple assumption.. :-)
you compute the frequency of all symbols, sort them all, and start assigning bit codes to each one.. the more frequent a symbol is, the shorter the code you'll assign to it.. simple as this.
the big question is: how large the window in which we compute such frequencies should be? should it be as large as the entire file? or should it be smaller? and if the latter apply, how large? Most huffman encoding have some sort of "test-run" in which they estimate the best window size a little bit like TCP/IP do with its windows frame sizes.
Huffman codes provide two benefits:
they are space efficient given some corpus
they are prefix codes
Given some set of documents for instance, encoding those documents as Huffman codes is the most space efficient way of encoding them, thus saving space. This however only applies to that set of documents as the codes you end up are dependent on the probability of the tokens/symbols in the original set of documents. The statistics are important because the symbols with the highest probability (frequency) are given the shortest codes. Thus the symbols most likely to be in your data use the least amount of bits in the encoding, making the coding efficient.
The prefix code part is useful because it means that no code is the prefix of another. In morse code for instance A = dot dash and J = dot dash dash dash, how do you know where to break reading the code. This increases the inefficiency of transmitting data using morse as you need a special symbol (pause) to signify the end of transmission of one code. Compare that to Huffman codes where each code is unique, as soon as you discover the encoding for a symbol in the input, you know that that is the transmitted symbol because it is guaranteed not to be the prefix of some other symbol.
It's the dual effect of having the most frequent characters using the shortest bit sequences that gives you the savings.
For a concrete example, let's say you have a piece of text that consists of 1024 e characters and 1024 of all other characters combined.
With 8 bits for code, that's a full 2048 bytes used in uncompressed form.
Now let's say we represent e as a single 1-bit and every other letter as a 0-bit followed by its original 8 bits (a very primitive form of Huffman).
You can see that half the characters have been expanded from 8 bits to 9, giving 9216 bits, or 1152 bytes. However, the e characters have been reduced from 8 bits to 1, meaning they take up 1024 bits, or 128 bytes.
The total bytes used is therefore 1152 + 128, or 1280 bytes, representing a compression ratio of 62.5%.
You can use a fixed encoding scheme based on the likely frequencies of characters (such as English text), or you can use adaptive Huffman encoding which changes the encoding scheme as characters are processed and frequencies are adjusted. While the former may be okay for input which has high probability of matching frequencies, the latter can adapt to any input.
Statistic table can't be wrong, because in general Huffman algorithm, analyze hole text at the beginning, and builds frequent-statistics of the given text, while Morse has a static symbol -code map.
Huffman algorithm uses the advantage of a given text. As an example, if E is most frequent letter in English in general, that doesn't mean that E is most frequent in a given text for a given author.
Another advantage of Huffman algorithm is that you can use it for any alphabet starting from [0, 1] finished Chinese hieroglyphs, while Morse is defined only for English letters
So in Morse code, "E" can be represented by a single dot, because it is the most frequent letter. But why? Why is it a dot because of its frequency?
"E" can be encoded to any unique code for a specific code dictionary, so it can be "0", we choose it to be short to save memory, so the average bytes used after encode is minimized.
Why is the probability / statistics of the chars so important to Huffman coding? What happens if the statistics table is wrong?
why do we encode? save space right? Space used after encode is freq(wordi)*Length(wordi), it is what we should try to minimize, so we choose to assign words with high prob short code greedly to save space.
If the statistics table is wrong, then the encoding is not the best way to save space.

Encode an array of integers to a short string

Problem:
I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers.
For example, here is an array of 360 elements:
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,
0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.
Goal:
The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y
What I've tried:
I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one".
To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5".
That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!
Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...)
Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect.
You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours.
Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy.
The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse".
Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best.
Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...
In your data you have:
14 1s (3.89% of data)
4 2s (1.11%)
1 3s, 4s and 5s (0.28%)
339 0s (94.17%)
Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.

Resources