Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have a set of 1,000,000 unique numbers. The numbers are in the interval between 0 an 50,000,000. Consider, the numbers are random. I need a data structure which would hold them all. The data-structure should require as little memory as possible. It should be possible to find quickly whether the number is in the set with no errors.
I found a solution with bloom filter. Yes, bloom filter has a probability of false positives, but since there are "just" 50,000,000 possible numbers, I can find all the errors and keep them in the std::set. By this method, I'm able to store all the numbers in 2.3MB of memory.
Can you find a better method?
Rather than a range of 0 to 50,000,000, how about 1,024 separate ranges of 65,536? That'd give you a 64 MB range. I suppose you can make it 763 rather than 1,024, which will give you 50,003,968.
Something like ushort[763][];
Now you're storing 1,000,000 16-bit values rather than 32-bit values.
The values in the rows are sorted. So to determine if a number is in the set, you divide the number by 763 to figure out which array to look in, and then do a binary search on number % 65536.
Storage for the numbers themselves is 2,000,000 bytes. Plus a small amount of overhead for the arrays.
This will be faster in execution, smaller than your Bloom filter approach, no false positives, and a whole lot easier to implement.
The minimum space to store such a vector in general is 884,002 bytes. That stores an integer index (a very large integer) into the list of all possible choices of 1,000,000 out of 50,000,000.
You can get close to that with a simple, fast byte encoding. Given the sorted list of numbers, replace each number with the difference from the last number. (Assume that -1 precedes the first number.) The differences are all one or more, so subtract one. If the result is 254 or less, code it as a byte. Otherwise, write 255, and follow with two bytes with a larger difference minus 255. If it doesn't fit in that, then write three 255's, and follow with three bytes with the difference. This will almost always code the vector in less than 1,012,000 bytes.
Related
a lookup table has a total of 4G entries, each entry of it is a 32bit arbitrary number but they never repeats.
is there any algorithm is able to utilize the index of each entry and its (index) value(32bit number)to make a fixed position bit of the value is always zero(so I can utilize the bit as a flag to log something). And I can retrieve the 32bit number by doing a reverse calculation.
Or step back and say, whether or not I can make a fixed position bit of every two continuous entries always zero?
my question is that is there any universal codes can make each arbitrary 32bit numeric save 1 bit. so I can utilize this bit as a lock flag. alternatively, is there a way can leverage the index and its value of a lookup table entry by some calculation to save 1 bit storage of the value.
It is not at all clear what you are asking. However I can perhaps find one thing in there that can be addressed, if I am reading it correctly, which is that you have a permutation of all of the integers in 0..232-1. Such a permutation can be represented in fewer bits than direct representation, which takes 32*232 bits. With a perfect representation of the permutations, each would be ceiling(log2(232!)) bits, since there are 232! possible permutations. That length turns out to be about 95.5% of the bits in the direct representation. So each permutation could be represented in about 30.6*232 bits, effectively taking off more than one bit per word.
I'm writing a compression algorithm (mostly for fun) in C, and I need to be able to store a list of numbers in binary. Each element of this list will be in the form of two digits, both under 10 (like (5,5), (3,6), (9,2)). I'll potentially be storing thousands of these pairs (one pair is made for each character in a string in my compression algorithm).
Obviously the simplest way to do this would be to concatenate each pair (-> 55, 36, 92) to make a 2-digit number (since they're just one digit each), then store each pair as a 7-bit number (since 99 is the highest). Unfortunately, this isn't so space-efficient (7 bits per pair).
Then I thought perhaps if I concatenate each pair, then concatenate that (553692), I'd be able to then store that as a plain number in binary form (10000111001011011100, which for three pairs is already smaller than storing each number separately), and keep a quantifier for the number of bits used for the binary number. The only problem is, this approach requires a bigint library and could be potentially slow because of that. As the number gets bigger and bigger (+2 digits per character in the string) the memory usage and slowdown would get bigger and bigger as well.
So here's my question: Is there a better storage-efficient way to store a list of numbers like I'm doing, or should I just go with the bignum or 7-bit approach?
The information-theoretic minimum for storing 100 different values is log2100, which is about 6.644. In other words, the possible compression from 7 bits is a hair more than 5%. (log2100 / 7 is 94.91%.)
If these pairs are simply for temporary storage during the algorithm, then it's almost certainly not worth going to a lot of effort to save 5% of storage, even if you managed to do that.
If the pairs form part of you compressed output, then your compression cannot be great (a character is only eight bits, and presumably the pairs are additional to any compressed character data.) Nonetheless, the easy compression technique is to store up to 6 pairs in 40 bits (5 bytes), which can be done without a bigint package assuming a 64-bit machine. (Alternatively, store up to 3 pairs in 20 bits and then pack two 20-bit sequences into five bytes.) That gives you 99.66% of the maximum compression for the values.
All of the above assumes that the 100 possible values are equally distributed. If the distribution is not even and it is possible to predict the frequencies, then you can use Hoffman encoding to improve compression. Even so, I wouldn't recommend it for temporary storage.
I faced this problem in a recent interview:
You have a stream of incoming numbers in range 0 to 60000 and you have a function which will take a number from that range and return the count of occurrence of that number till that moment. Give a suitable Data structure/algorithm to implement this system.
My solution is:
Make an array of size 60001 pointing to bit-vectors. These bit vectors will contain the count of the incoming numbers and the incoming numbers will also be used to index into the array for the corresponding number. Bit-vectors will dynamically increase as the count gets too big to hold in them.
So, if the numbers are coming at rate 100numbers/sec then, in 1million years total numbers will be = (100*3600*24)*365*1000000 = 3.2*10^15. In the worst case where all numbers in the stream is same it will take ceil((log(3.2*10^15) / log 2) )= 52bits and if the numbers are uniformly distributed the we will have (3.2*10^15) / 60001 = 5.33*10^10 number of occurrences for each number which will require total of 36 bits for each numbers.
So, assuming 4byte pointers we need (60001 * 4)/1024 = 234 KB memory for the array and for the case with same numbers, we need bit vector size = 52/8 = 7.5 bytes which is still around 234KB. And for the other case we need (60001 * 36 / 8)/1024 = 263.7 KB for bit vector totaling about 500KB. So, it is very much feasible to do this with ordinary PC and memory.
But the interviewer said, as it is infinite stream it will eventually overflow and gave me hint like how can we do this if there were many PCs and we could pass messages between them or think about file system etc. But I kept thinking if this solution was not working then, others would too. Needless to say, I did not get the job.
How to do this problem with less memory? Can you think of an alternative approach (using network of PCs may be)?
A formal model for the problem could be the following.
We want to know if it exists a constant space bounded Turing machine such that, in any given time it recognizes the language L of all couples (number,number of occurrences so far). This means that all correct couples will be accepted and all incorrect couples will be rejected.
As a corollary of the Theorem 3.13 in Hopcroft-Ullman we know that every language recognized by a constant space bounded machine is regular.
It can be proven by using the pumping lemma for regular languages that the language described above is not a regular language. So you can't recognize it with a constant space bounded machine.
you can easily use index based search, by using an array like int arr[60000][1], whenever you get a number , say 5000, directly access the index( num-1) = (5000-1) as, arr[num-1][1], and increment the number, and now whenever u want to know how many times a particular num has ocurred you can just access it by arr[num-1][1] and you'll get the count for that number, Its simplest possible linear time implementation.
Isn't this External Sorting? Store the infinite stream in a file. Do a seek() (RandomAccessFile.seek() in Java) in the file and get to the appropriate timestamp. This is similar to Binary Search since the data is sorted by timestamps. Once you get to the appropriate timestamp, the problem turns into counting a particular number from an infinite set of numbers. Here, instead of doing a quick sort in memory, Counting sort can be done since the range of numbers is limited.
Sort at most 10 million 7-digit numbers. constraints: 1M RAM, high speed. several secs is good.
[Edit: from comment by questioner: the input values are distinct]
Using Bitmap Data Structure is a good solution for this problem.
That means I need a string, which length is at most 10 Million????
So is the RAM enough for this?
confused here.
Thank you
So, there are ~8,000,000 bits in 1MB but if you have arbitrary 7 digit numbers (up to 9,999,999) using a bit vector to do the sort won't work. Similarly, it won't work if some numbers can be repeated because you can only store {0,1} in a bit vector.
But assuming, (what I think your problem is asking) that you have a sequence of integers between 0 and 8,000,000 with no duplicates, you can simply allocate a zeroed array of 8,000,000 bits and then for each number, tick off the corresponding bit in the array. Then outputting the sorted list is simply reading through that array in sequence and outputting the index for each 1 value.
If you are asking the more complex version of the question (0 - 10 million, repeats allowed), then you will need to to sort chunks that fit in ram, store them on disk, and then you can merge these chunks in linear time and streaming (so you don't ever have to store the whole thing in memory). Here is an implementation of a very similar thing in python: http://neopythonic.blogspot.com/2008/10/sorting-million-32-bit-integers-in-2mb.html
Start with a bit array representing the lowest 8 million possible values. Read through the input and set a bit for every value within range. Then output all the numbers for the turned-on bits in sequence. Next Clear the first 2 million bits of the array so that it can represent the highest 2 million possible values. Read through the input and set a bit for every value in the new range. Output all the values in this range. Done.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
Specifically, what programs are out there and what has the highest compression ratio? I tried Googling it, but it seems experience would trump search results, so I ask.
If file sizes could be specified accurate to the bit, for any file size N, there would be precisely 2^(N+1)-1 possible files of N bits or smaller. In order for a file of size X to be mapped to some smaller size Y, some file of size Y or smaller must be mapped to a file of size X or larger. The only way lossless compression can work is if some possible files can be identified as being more probable than others; in that scenario, the likely files will be shrunk and the unlikely ones will grow.
As a simple example, suppose that one wishes to store losslessly a file in which the bits are random and independent, but instead of 50% of the bits being set, only 33% are. One could compress such a file by taking each pair of bits and writing "0" if both bits were clear, "10" if the first bit was set and the second one not, "110" if the second was set and the first not, or "111" if both bits were set. The effect would be that each pair of bits would become one bit 44% of the time, two bits 22% of the time, and three bits 33% of the time. While some strings of data would grow, others would shrink; the pairs that shrank would--if the probability distribution was as expected--outnumber those that grow (4/9 files would shrink by a bit, 2/9 would stay the same, and 3/9 would grow, so pairs would on average shrink by 1/9 bit, and files would on average shrink by 1/18 [since the 1/9 figure was bits per pair]).
Note that if the bits actually had a 50% distribution, then only 25% of pairs would become one bit, 25% would stay two bits, and 50% would become three bits. Consequently, 25% of bits would shrink and 50% would grow, so pairs on average would grow by 25%, and files would grow by 12.5%. The break-even point would be about 38.2% of bits being set (two minus the golden mean), which would yield 38.2% of bit pairs shrinking and the same percentage growing.
There is no one universally best compression algorithm. Different algorithms have been invented to handle different data.
For example, JPEG compression allows you to compress images quite a lot because it doesn't matter too much if the red in your image is 0xFF or 0xFE (usually). However, if you tried to compress a text document, changes like this would be disastrous.
Also, even between two compression algorithms designed to work with the same kind of data, your results will vary depending on your data.
Example: Sometimes using a gzip tarball is smaller, and sometimes using a bzip tarball is smaller.
Lastly, for truly random data of sufficient length, your data will likely have almost the same size as (or even larger than) the original data.