performance of lz77 compression on large random array - lz77

I have a large array of raw bytes. The byte array does not follow any patter and the byte values can be any value (random).The size of the array is 45000x5x400 bytes. And I have 1 second to encode. Will lz77 be feasible? And how should I choose the look up dictionary? Also what will be a good choice for number of bits to hold the relative index and the match length ?

Related

how does rabin-karp choose breakpoint in variable-length chunking?

I understand the rabin-karp algo and its usage in string searching. What I don't quite understand is how it can dynamically slice a file into variable-length chunks.
It's said to calculate the hash of a small window of data bytes (ex: 48 bytes) at every single byte offset, and the chunk boundaries—called breakpoints—are whenever the last N (ex: 13) bits of the hash are zero. This gives you an average block size of 2^N = 2^13 = 8192 = 8 KB.
Questions:
Does the rabin-karp rolling hash start from the first 48 bytes, then roll over one byte each time.
If so, is it too much to calculate for a large file even with simple hash function?
Given unpredictable data, how is it possible to have N bits of the hash are zero within the large chunk size limit?
Yes, the sliding window is fix-sized, moving forward byte by byte.
The hash function has O(n) complexity, in each step it only add (and may shift bits) the next byte and minus the original first byte in the window, which is the core method of Rabin hash.
It depends on the hash function actually. The distribution of the chuck sizes may be different. To reduce chunk size variability, the Two Thresholds, Two Divisors Algorithm (TTTD) was proposed. You can also find some advances in this thread from academic research papers.

How to compress an array of random positive integers in a certain range?

I want to compress an array consisting of about 10^5 random integers in range 0 to 2^15. The integers are unsorted and I need to compress them lossless.
I don't care much about the amount of computation and time needed to run the algorithm, just want to have better compression ratio.
Are there any suggested algorithms for this?
Assuming you don´t need to preserve original order, instead of passing the numbers themselves, pass the count. If they have a normal distribution, you can expect each number to be repeated 3 or 4 times. With 3 bits per number, we can count up to 7. You can make an array of 2^15 * 3 bits and every 3 bits set the count of that number. To handle extreme cases that have more than 7, we can also send a list of numbers and their counts for these cases. Then you can read the 3 bits array and overwrite with the additional info for count higher than 7.
For your exact example: just encode each number as a 15-bit unsigned int and apply bit packing. This is optimal since you have stated each integer in uniformly random in [0, 2^15), and the Shannon Entropy of this distribution is 15 bits.
For a more general solution, apply Quantile Compression (https://github.com/mwlon/quantile-compression/). It takes advantage of any smooth-ish data and compresses near optimally on shuffled data. It works by encoding each integer with a Huffman code for it coarse range in the distribution, then an exact offset within that range.
These approaches are both computationally cheap, but more compute won't get you further in this case.

Data structure for iterating over subsets of a collection

This is a general question about algorithims/data structures.
There is no specific programming language.
I'm working with arrays of boolean values.
The size of the arrays is always 50.
I want to have a collections of these arrays.
I will need to iterate over my collection multiple times.
In order to increase performance, I would like to limit each iteration to a subset of the collection. Rather than the whole collection.
For instance: to iterate only over the arrays that have FALSE in the 4th and 13th position
I will NOT need to search for TRUE values. Only for FALSE values in certain positions of the array.
Note that the possible subsets can share elements without one being included in the other.
Is there any kind of data structure that could help me?
There is a short piece on this in Knuth Volume III "Sorting and Searching" section 6.5. There are a few simple schemes laid out there with somewhat limited effectiveness, which suggests to me that there are no magic answers to this hard problem of searching multi-dimensional spaces.
One approach described here (with parameters adjusted for your problem) is to divide your 50 bits into 5 10-bit chunks, and create 5 hash tables, each mapping a value for one of those 10-bit chunks to a list of all the records which have that value in the given position. Given values for particular bit positions, choose whichever one of the 10-bit chunks contains the most known bits, and use this to only check 512 or 256 or 128 of the 1024 lists in the table for that chunk, depending on whether you know 1, 2, or 3 of the bit positions in that chunk.

How to generate 256-bit random number within a range on embedded system

I need to generate cyptographically secure random number which is 256-bit long in specific range. I use microcontroller suited with random number generator (producer boasts that it's true random number, based on thermal noise).
The upper limit of number to be generated is given as byte array. My question is: will it be secure, to get the random number byte by byte, and performing:
n[i] = rand[i] mod limit[i]
where n[i] is i'th byte of my number etc.
The standard method, using all the bits from the RNG is:
number <- random()
while (number outside range)
number <- random()
endwhile
return number
There are some tweaks possible if the required range is less than half the size of the RNG output, but I assume that is not the case here: it would reduce the output size by one or more bits. Given that, then the while loop will normally only be entered once or twice if at all.
Comparing byte arrays is reasonably simple, and usually speedy providing you compare the most significant bytes first. If the most significant bytes differ, then there is no need to compare less significant bytes at all. We can tell that 7,###,###,### is larger than 5,###,###,### without knowing what digits the # stand for.

Best way to represent numbers of unbounded length?

What's the most optimal (space efficient) way to represent integers of unbounded length?
(The numbers range from zero to positive-infinity)
Some sample number inputs can be found here (each number is shown on it's own line).
Is there a compression algorithm that is specialized in compressing numbers?
You've basically got two alternatives for variable-length integers:
Use 1 bit of every k as an end terminator. That's the way Google protobuf does it, for example (in their case, one bit from every byte, so there are 7 useful bits in every byte).
Output the bit-length first, and then the bits. That's how ASN.1 works, except for OIDs which are represented in form 1.
If the numbers can be really big, Option 2 is better, although it's more complicated and you have to apply it recursively, since you may have to output the length of the length, and then the length, and then the number. A common technique is to use a Option 1 (bit markers) for the length field.
For smallish numbers, option 1 is better. Consider the case where most numbers would fit in 64 bits. The overhead of storing them 7 bits per byte is 1/7; with eight bytes, you'd represent 56 bits. Using even the 7/8 representation for length would also represent 56 bits in eight bytes: one length byte and seven data bytes. Any number shorter than 48 bits would benefit from the self-terminating code.
"Truly random numbers" of unbounded length are, on average, infinitely long, so that's probably not what you've got. More likely, you have some idea of the probability distribution of number sizes, and could choose between the above options.
Note that none of these "compress" (except relative to the bloated ascii-decimal format). The asymptote of log n/n is 0, so as the numbers get bigger the size of the size of the numbers tends to occupy no (relative) space. But it still needs to be represented somehow, so the total representation will always be a bit bigger than log2 of the number.
You cannot compress per se, but you can encode, which may be what you're looking for. You have files with sequences of ASCII decimal digits separated by line feeds. You should simply Huffman encode the characters. You won't do much better than about 3.5 bits per character.

Resources