Related
I have a large sequence of random integers sorted from the lowest to the highest. The numbers start from 1 bit and end near 45 bits. In the beginning of the list I have numbers very close to each other: 4, 20, 23, 40, 66. But when the numbers start to get higher the distance between them is a bit higher too (actually the distance between them is aleatory). There are no duplicated numbers.
I'm using bit packing to save some space. Nonetheless, this file can get really big.
I would like to know what kind of compression algorithm can be used in this situation, or any other technique to save as much space as possible.
Thank you.
You can compress optimally if you know the true distribution of the data. If you can provide a probability distribution for each integer you can use arithmetic coding or other entropy coding techniques to compress to theoretical minimal size.
The trick is in predicting accurately.
First, you should probably compress the distances between the numbers because that allows you to make statistical statements. If you were to compress the numbers directly you'd have a hard time modelling them because they occur only once.
Next, you could try to build a very simple model to predict the next distance. Keep a histogram of all previously seen distances and calculate the probabilities from the frequencies.
You probably need to account for missing values (you clearly can't assign them 0 probability because that is not expressible) but you can use heuristics for that, like encoding the next distance bit-by-bit and predicting each bit individually. You will pay almost nothing for the high-order bits because they are almost always 0 and entropy encoding optimizes them away.
All of this is much simpler if you know the distribution. Example: You you are compressing a list of all prime numbers you know the theoretical distribution of distances because there are formulae for that. So you already have a perfect model.
There's a very simple and fairly effective compression technique which can be used for sorted integers in a known range. Like most compression schemes, it is optimized for serial access, although you can build an index to speed up random access if needed.
It's a type of delta encoding (i.e. each number is represented by the distance from the previous one), consisting of a vector of codes which are either
a single 1-bit, representing a delta of 2k which is added to the delta in the following code, or
a 0-bit followed by a k-bit delta, indicating that the next number is the specified delta from the previous one.
For example, if k is 4, the sequence:
00011 1 1 00000 1 00001
codes three numbers. The first four-bit encoding (3) is the first delta, taken from an initial value of 0, so the first number is 3. The next two solitary 1's accumulate to a delta of 2·24, or 32, which is added to the following delta of 0000, for a total of 32. So the second number is 3+32=35. Finally, the last delta is a single 24 plus 1, total 17, and the third number is 35+17=52.
The 1-bit indicates that the next delta should be incremented by 2k (or, more generally, each delta is incremented by 2k times the number of immediately preceding 1-bits.)
Another, possibly better, way of thinking of this is that each delta is coded as a variable length bit sequence: 1i0(1|0)k, representing a delta of i·2k+[the k-bit suffix]. But the first presentation aligns better with the optimality proof.
Since each "1" code represents an increment of 2k, there cannot be more than m/2k of them, where m is the largest number in the set to be compressed. The remaining codes all correspond to numbers, and have a total length of n·(k + 1) where n is the size of the set. The optimal value of k is roughly log2 m/n, which in your case would be 7 or 8.
I did a quick proof of concept of the algorithm, without worrying about optimizations. It's still plenty fast; sorting the random sample takes a lot longer than compressing/decompressing it. I tried it with a few different seeds and vector sizes from 16,400,000 to 31,000,000 with a value range of [0, 4,000,000,000). The bits used per data value ranged from 8.59 (n=31000000) to 9.45 (n=16400000). All of the tests were done with 7-bit suffixes; log2 m/n varies from 7.01 (n=31000000) to 7.93 (n=16400000). I tried with 6-bit and 8-bit suffixes; except in the case of n=31000000 where the 6-bit suffixes were slightly smaller, the 7-bit suffix was always the best. So I guess that the optimal k is not exactly floor(log2 m/n) but it's not far off.
Compression code:
void Compress(std::ostream& os,
const std::vector<unsigned long>& v,
unsigned long k = 0) {
BitOut out(os);
out.put(v.size(), 64);
if (v.size()) {
unsigned long twok;
if (k == 0) {
unsigned long ratio = v.back() / v.size();
for (twok = 1; twok <= ratio / 2; ++k, twok *= 2) { }
} else {
twok = 1 << k;
}
out.put(k, 32);
unsigned long prev = 0;
for (unsigned long val : v) {
while (val - prev >= twok) { out.put(1); prev += twok; }
out.put(0);
out.put(val - prev, k);
prev = val;
}
}
out.flush(1);
}
Decompression:
std::vector<unsigned long> Decompress(std::istream& is) {
BitIn in(is);
unsigned long size = in.get(64);
if (size) {
unsigned long k = in.get(32);
unsigned long twok = 1 << k;
std::vector<unsigned long> v;
v.reserve(size);
unsigned long prev = 0;
for (; size; --size) {
while (in.get()) prev += twok;
prev += in.get(k);
v.push_back(prev);
}
}
return v;
}
It can be a bit awkward to use variable-length encodings; an alternative is to store the first bit of each code (1 or 0) in a bit vector, and the k-bit suffixes in a separate vector. This would be particularly convenient if k is 8.
A variant, which results in slight longer files but is a bit easier to build indexes for, is to only use the 1-bits as deltas. Then the deltas are always a·2k for some a, possibly 0, where a is the number of consecutive 1 bits preceding the suffix code. The index then consists of the locations of every Nth 1-bit in the bit vector, and the corresponding index into the suffix vector (i.e. the index of the suffix corresponding with the next 0 in the bit vector).
One option that worked well for me in the past was to store a list of 64-bit integers as 8 different lists of 8-bit values. You store the high 8 bits of the numbers, then the next 8 bits, etc. For example, say you have the following 32-bit numbers:
0x12345678
0x12349785
0x13111111
0x13444444
The data stored would be (in hex):
12,12,13,13
34,34,11,44
56,97,11,44
78,85,11,44
I then ran that through the deflate compressor.
I don't recall what compression ratios I was able to achieve with this, but it was significantly better than compressing the numbers themselves.
I want to add another answer with the simplest possible solution:
Convert the numbers to deltas as discussed previously
Run it through the 7-zip LZMA2 algorithm. It is even multi-core ready
I think this will give almost perfect results in your case because the distances have a simple distribution. 7-zip will be able to pick it up.
You can use Delta Encoding and Protocol Buffers simply.
Like your example: 4, 20, 23, 40, 66.
Delta Encoding compressed: 4, 16, 3, 17, 26.
Then you store all numbers as varint in Protocol Buffers directly. Only need 1 byte for number between 0-127. And 2 bytes for number between 128-16384... This is enough for most scenes.
Further more you can use entropy coding(huffman) to achieve more effective compression rate than varint. Even less than 8bits per number.
Divide a number to 2 part. Like 17=...0001 0001(binary)=(5)0001. The first part (5) is valid bit count. The suffix part (0001) is without the leading 1.
Like the example: 4, 16, 3, 17, 26 = (3)00 (5)0000 (2)1 (5)0001 (5)1010
The first part will be between 0-45 even there are a lot of numbers. So they can be compressed by entropy coding like huffman effectively.
If your sequence is made up of pseudo-random numbers, such as might be generated by a typical digital computer, then I don't think that any compression scheme will beat, for brevity of representation, simply storing the code for the generator and whatever parameters you need to define its initial state.
If your sequence is made up of truly random numbers generated in some non-deterministic way then the other answers already posted offer a variety of good advice.
I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
4 billion bits = (4000000000 / 8) bytes = about 0.466 GB
The code to implement the bitset is simple: (taken from solutions page )
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.
If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.
HTH.
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
if bin_count < 65536:
break # we have found an incomplete bin with missing ints (bin_num)
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
if N // 65536 == bin_num:
my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
if not bit:
print bin_num*65536 + i
break
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers, 0 to 4294967295.
Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.
To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.
The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
0 - 00
1 - 01
2 - 10
3 - 11
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).
Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.
Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.
The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
result = result ^ supplied;
}
return result;
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
result = result ^ (supplied - offset);
}
return result + offset;
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
if (supplied < offset) {
offset = supplied;
}
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
n++;
result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
result = result ^ (n++);
}
return result + offset;
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
function find($array) {
$offset = min($array);
$n = 0;
$result = 0;
foreach ($array as $value) {
$result = $result ^ ($value - $offset);
$n++;
}
$n++; // This takes care of the missing value
while ($n == 1 || 0 != ($n & ($n - 1))) {
$result = $result ^ ($n++);
}
return $result + $offset;
}
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
if (supplied != last + 1) {
return last + 1;
}
last = supplied;
}
// The range is contiguous, so what do we do here? Let's return last + 1:
return last + 1;
Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.
Of course you'd need a backup plan in case n+1 causes an integer overflow.
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
void maxNum(ulong filesize)
{
ulong bitcount = filesize * 8; //number of bits in file
for (ulong i = 0; i < bitcount; i++)
{
Console.Write(9);
}
}
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.
From Reddit by Carbonetc.
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from int_min to int_max, and
bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)
for (integer i = int_min; i <= int_max; ++i)
{
if (isNotInFile(i)) {
return i;
}
}
For the 10 MB memory constraint:
Convert the number to its binary representation.
Create a binary tree where left = 0 and right = 1.
Insert each number in the tree using its binary representation.
If a number has already been inserted, the leafs will already have been created.
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s.
foreach (var number in file) {
bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0.
}
for (var i = 0; i < 4294967296; i++) {
if (bitArray[i] == 0) {
return i - 2147483648;
}
}
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
val = val & ~fileVal;
if (val == 0) error;
}
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
struct Range
{
long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
for (b=0; b<4; b++) {
print "2";
}
}
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!
(Joking... kind of.)
I want to multiply two numbers, and detect if there was an overflow. What is the simplest way to do that?
Multiplying two 32 bit numbers results in a 64 bit answer, two 8s give a 16, etc. binary multiplication is simply shifting and adding. so if you had say two 32 bit operands and bit 17 set in operand A and any of the bits above 15 or 16 set in operand b you will overflow a 32 bit result. bit 17 shifted left 16 is bit 33 added to a 32.
So the question again is what are the size of your inputs and the size of your result, if the result is the same size then you have to find the most significant 1 of both operands add those bit locations if that result is bigger than your results space you will overflow.
EDIT
Yes multiplying two 3 bit numbers will result in either a 5 bit number or 6 bit number if there is a carry in the add. Likewise a 2 bit and 5 bit can result in 6 or 7 bits, etc. If the reason for this question posters question is to see if you have space in your result variable for an answer then this solution will work and is relatively fast for most languages on most processors. It can be significantly faster on some and significantly slower on others. It is generically fast (depending on how it is implemented of course) to just look at the number of bits in the operands. Doubling the size of the largest operand is a safe bet if you can do it within your language or processor. Divides are downright expensive (slow) and most processors dont have one much less at an arbitrary doubling of operand sizes. The fastest of course is to drop to assembler do the multiply and look at the overflow bit (or compare one of the result registers with zero). If your processor cant do the multiply in hardware then it is going to be slow no matter what you do. I am guessing that asm is not the right answer to this post despite being by far the fastest and has the most accurate overflow status.
binary makes multiplication trivial compared to decimal, for example take the binary numbers
0b100 *
0b100
Just like decimal math in school you (can) start with the least significant bit on the lower operand and multiply it against all the locations in the upper operand, except with binary there are only two choices you multiply by zero meaning you dont have to add to the result, or you multiply by one which means you just shift and add, no actual multiplication is necessary like you would have in decimal.
000 : 0 * 100
000 : 0 * 100
100 : 1 * 100
Add up the columns and the answer is 0b10000
Same as decimal math a 1 in the hundreds column means copy the top number and add two zeros, it works the same in any other base as well. So 0b100 times 0b110 is 0b1000, a one in the second column over so copy and add a zero + 0b10000 a one in the third column over so copy and add two zeros = 0b11000.
This leads to looking at the most significant bits in both numbers. 0b1xx * 0b1xx guarantees a 1xxxx is added to the answer, and that is the largest bit location in the add, no other single inputs to the final add have that column populated or a more significant column populated. From there you need only more bit in case the other bits being added up cause a carry.
Which happens with the worst case all ones times all ones, 0b111 * 0b111
0b00111 +
0b01110 +
0b11100
This causes a carry bit in the addition resulting in 0b110001. 6 bits. a 3 bit operand times a 3 bit operand 3+3=6 6 bits worst case.
So size of the operands using the most significant bit (not the size of the registers holding the values) determines the worst case storage requirement.
Well, that is true assuming positive operands. If you consider some of these numbers to be negative it changes things but not by much.
Minus 4 times 5, 0b1111...111100 * 0b0000....000101 = -20 or 0b1111..11101100
it takes 4 bits to represent a minus 4 and 4 bits to represent a positive 5 (dont forget your sign bit). Our result required 6 bits if you stripped off all the sign bits.
Lets look at the 4 bit corner cases
-8 * 7 = -56
0b1000 * 0b0111 = 0b1001000
-1 * 7 = -7 = 0b1001
-8 * -8 = 64 = 0b01000000
-1 * -1 = 2 = 0b010
-1 * -8 = 8 = 0b01000
7 * 7 = 49 = 0b0110001
Lets say we count positive numbers as the most significant 1 plus one and negative the most significant 0 plus one.
-8 * 7 is 4+4=8 bits actual 7
-1 * 7 is 1+4=5 bits, actual 4 bits
-8 * -8 is 4+4=8 bits, actual 8 bits
-1 * -1 is 1+1=2 bits, actual 3 bits
-1 * -8 is 1+4=5 bits, actual 5 bits
7 * 7 is 4+4=8 bits, actual 7 bits.
So this rule works, with the exception of -1 * -1, you can see that I called a minus one one bit, for the plus one thing find the zero plus one. Anyway, I argue that if this were a 4 bit * 4 bit machine as defined, you would have 4 bits of result at least and I interpret the question as how may more than 4 bits do I need to safely store the answer. So this rule serves to answer that question for 2s complement math.
If your question was to accurately determine overflow and then speed is secondary, then, well it is going to be really really slow for some systems, for every multiply you do. If this is the question you are asking, to get some of the speed back you need to tune it a little better for the language and/or processor. Double up the biggest operand, if you can, and check for non-zero bits above the result size, or use a divide and compare. If you cant double the operand sizes, divide and compare. Check for zero before the divide.
Actually your question doesnt specify what size of overflow you are talking about either. Good old 8086 16 bit times 16 bit gives a 32 bit result (hardware), it can never overflow. What about some of the ARMs that have a multiply, 32 bit times 32 bit, 32 bit result, easy to overflow. What is the size of your operands for this question, are they the same size or are they double the input size? Are you willing to perform multiplies that the hardware cannot do (without overflowing)? Are you writing a compiler library and trying to determine if you can feed the operands to the hardware for speed or if you have to perform the math without a hardware multiply. Which is the kind of thing you get if you cast up the operands, the compiler library will try to cast the operands back down before doing the multiply, depending on the compiler and its library of course. And it will use the count the bit trick determine to use the hardware multiply or a software one.
My goal here was to show how binary multiply works in a digestible form so you can see how much maximum storage you need by finding the location of a single bit in each operand. Now how fast you can find that bit in each operand is the trick. If you were looking for minimum storage requirements not maximum that is a different story because involves every single one of the significant bits in both operands not just one bit per operand, you have to do the multiply to determine minimum storage. If you dont care about maximum or minimum storage you have to just do the multiply and look for non zeros above your defined overflow limit or use a divide if you have the time or hardware.
Your tags imply you are not interested in floating point, floating point is a completely different beast, you cannot apply any of these fixed point rules to floating point, they DO NOT work.
Check if one is less than a maximum value divided by the other. (All values are taken as absolute).
2's complementness hardly has anything to do with it, since the multiplication overflows if x*(2n - x)>2M, which is equal to (x*2n - x2)>2M, or x2 < (x*2n - 2M), so you'll have to compare overflowing numbers anyway (x2 may overflow, while result may not).
If your number are not from the largest integral data type, then you might just cast them up, multiply and compare with the maximum of the number's original type. E.g. in Java, when multiplying two int, you can cast them to long and compare the result to Integer.MAX_VALUE or Integer.MIN_VALUE (depending on sign combination), before casting the result down to int.
If the type already is the largest, then check if one is less than the maximum value divided by the other. But do not take the absolute value! Instead you need separate comparison logic for each of the sign combinations negneg, pospos and posneg (negpos can obviously be reduced to posneg, and pospos might be reduced to neg*neg). First test for 0 arguments to allow safe divisions.
For actual code, see the Java source of MathUtils class of the commons-math 2, or ArithmeticUtils of commons-math 3. Look for public static long mulAndCheck(long a, long b). The case for positive a and b is
// check for positive overflow with positive a, positive b
if (a <= Long.MAX_VALUE / b) {
ret = a * b;
} else {
throw new ArithmeticException(msg);
}
I want to multiply two (2's complement) numbers, and detect if there was an overflow. What is the simplest way to do that?
Various languages do not specify valid checking for overflow after it occurs and so prior tests are required.
With some types, a wider integer type may not exist, so a general solution should limit itself to a single type.
The below (Ref) only requires compares and known limits to the integer range. It returns 1 if a product overflow will occur, else 0.
int is_undefined_mult1(int a, int b) {
if (a > 0) {
if (b > 0) {
return a > INT_MAX / b; // a positive, b positive
}
return b < INT_MIN / a; // a positive, b not positive
}
if (b > 0) {
return a < INT_MIN / b; // a not positive, b positive
}
return a != 0 && b < INT_MAX / a; // a not positive, b not positive
}
Is this the simplest way?
Perhaps, yet it is complete and handle all cases known to me - including rare non-2's complement.
Alternatives to Pavel Shved's solution ...
If your language of choice is assembler, then you should be able to check the overflow flag. If not, you could write a custom assembler routine that sets a variable if the overflow flag was set.
If this is not acceptable, you can find the most signficant set bit of both values (absolutes). If the sum exceeds the number of bits in the integer (or unsigned) then you will have an overflow if they are multiplied together.
Hope this helps.
In C, here's some maturely optimized code that handles the full range of corner cases:
int
would_mul_exceed_int(int a, int b) {
int product_bits;
if (a == 0 || b == 0 || a == 1 || b == 1) return (0); /* always okay */
if (a == INT_MIN || b == INT_MIN) return (1); /* always underflow */
a = ABS(a);
b = ABS(b);
product_bits = significant_bits_uint((unsigned)a);
product_bits += significant_bits_uint((unsigned)b);
if (product_bits == BITS(int)) { /* cases where the more expensive test is required */
return (a > INT_MAX / b); /* remember that IDIV and similar are very slow (dozens - hundreds of cycles) compared to bit shifts, adds */
}
return (product_bits > BITS(int));
}
Full example with test cases here
The benefit of the above approach is it doesn't require casting up to a larger type, so the approach could work on larger integer types.
Scroll down to see latest edit, I left all this text here just so that I don't invalidate the replies this question has received so far!
I have the following brain teaser I'd like to get a solution for, I have tried to solve this but since I'm not mathematically that much above average (that is, I think I'm very close to average) I can't seem wrap my head around this.
The problem: Given number x should be split to a serie of multipliers, where each multiplier <= y, y being a constant like 10 or 16 or whatever. In the serie (technically an array of integers) the last number should be added instead of multiplied to be able to convert the multipliers back to original number.
As an example, lets assume x=29 and y=10. In this case the expected array would be {10,2,9} meaning 10*2+9. However if y=5, it'd be {5,5,4} meaning 5*5+4 or if y=3, it'd be {3,3,3,2} which would then be 3*3*3+2.
I tried to solve this by doing something like this:
while x >= y, store y to multipliers, then x = x - y
when x < y, store x to multipliers
Obviously this didn't work, I also tried to store the "leftover" part separately and add that after everything else but that didn't work either. I believe my main problem is that I try to think this in a way too complex manner while the solution is blatantly obvious and simple.
To reiterate, these are the limits this algorithm should have:
has to work with 64bit longs
has to return an array of 32bit integers (...well, shorts are OK too)
while support for signed numbers (both + and -) would be nice, if it helps the task only unsigned numbers is a must
And while I'm doing this using Java, I'd rather take any possible code examples as pseudocode, I specifically do NOT want readily made answers, I just need a nudge (well, more of a strong kick) so that I can solve this at least partly myself. Thanks in advance.
Edit: Further clarification
To avoid some confusion, I think I should reword this a bit:
Every integer in the result array should be less or equal to y, including the last number.
Yes, the last number is just a magic number.
No, this is isn't modulus since then the second number would be larger than y in most cases.
Yes, there is multiple answers to most of the numbers available, however I'm looking for the one with least amount of math ops. As far as my logic goes, that means finding the maximum amount of as big multipliers as possible, for example x=1 000 000,y=100 is 100*100*100 even though 10*10*10*10*10*10 is equally correct answer math-wise.
I need to go through the given answers so far with some thought but if you have anything to add, please do! I do appreciate the interest you've already shown on this, thank you all for that.
Edit 2: More explanations + bounty
Okay, seems like what I was aiming for in here just can't be done the way I thought it could be. I was too ambiguous with my goal and after giving it a bit of a thought I decided to just tell you in its entirety what I'd want to do and see what you can come up with.
My goal originally was to come up with a specific method to pack 1..n large integers (aka longs) together so that their String representation is notably shorter than writing the actual number. Think multiples of ten, 10^6 and 1 000 000 are the same, however the representation's length in characters isn't.
For this I wanted to somehow combine the numbers since it is expected that the numbers are somewhat close to each other. I firsth thought that representing 100, 121, 282 as 100+21+161 could be the way to go but the saving in string length is neglible at best and really doesn't work that well if the numbers aren't very close to each other. Basically I wanted more than ~10%.
So I came up with the idea that what if I'd group the numbers by common property such as a multiplier and divide the rest of the number to individual components which I can then represent as a string. This is where this problem steps in, I thought that for example 1 000 000 and 100 000 can be expressed as 10^(5|6) but due to the context of my aimed usage this was a bit too flaky:
The context is Web. RESTful URL:s to be specific. That's why I mentioned of thinking of using 64 characters (web-safe alphanumberic non-reserved characters and then some) since then I could create seemingly random URLs which could be unpacked to a list of integers expressing a set of id numbers. At this point I thought of creating a base 64-like number system for expressing base 10/2 numbers but since I'm not a math genius I have no idea beyond this point how to do it.
The bounty
Now that I have written the whole story (sorry that it's a long one), I'm opening a bounty to this question. Everything regarding requirements for the preferred algorithm specified earlier is still valid. I also want to say that I'm already grateful for all the answers I've received so far, I enjoy being proven wrong if it's done in such a manner as you people have done.
The conclusion
Well, bounty is now given. I spread a few comments to responses mostly for future reference and myself, you can also check out my SO Uservoice suggestion about spreading bounty which is related to this question if you think we should be able to spread it among multiple answers.
Thank you all for taking time and answering!
Update
I couldn't resist trying to come up with my own solution for the first question even though it doesn't do compression. Here is a Python solution using a third party factorization algorithm called pyecm.
This solution is probably several magnitudes more efficient than Yevgeny's one. Computations take seconds instead of hours or maybe even weeks/years for reasonable values of y. For x = 2^32-1 and y = 256, it took 1.68 seconds on my core duo 1.2 ghz.
>>> import time
>>> def test():
... before = time.time()
... print factor(2**32-1, 256)
... print time.time()-before
...
>>> test()
[254, 232, 215, 113, 3, 15]
1.68499994278
>>> 254*232*215*113*3+15
4294967295L
And here is the code:
def factor(x, y):
# y should be smaller than x. If x=y then {y, 1, 0} is the best solution
assert(x > y)
best_output = []
# try all possible remainders from 0 to y
for remainder in xrange(y+1):
output = []
composite = x - remainder
factors = getFactors(composite)
# check if any factor is larger than y
bad_remainder = False
for n in factors.iterkeys():
if n > y:
bad_remainder = True
break
if bad_remainder: continue
# make the best factors
while True:
results = largestFactors(factors, y)
if results == None: break
output += [results[0]]
factors = results[1]
# store the best output
output = output + [remainder]
if len(best_output) == 0 or len(output) < len(best_output):
best_output = output
return best_output
# Heuristic
# The bigger the number the better. 8 is more compact than 2,2,2 etc...
# Find the most factors you can have below or equal to y
# output the number and unused factors that can be reinserted in this function
def largestFactors(factors, y):
assert(y > 1)
# iterate from y to 2 and see if the factors are present.
for i in xrange(y, 1, -1):
try_another_number = False
factors_below_y = getFactors(i)
for number, copies in factors_below_y.iteritems():
if number in factors:
if factors[number] < copies:
try_another_number = True
continue # not enough factors
else:
try_another_number = True
continue # a factor is not present
# Do we want to try another number, or was a solution found?
if try_another_number == True:
continue
else:
output = 1
for number, copies in factors_below_y.items():
remaining = factors[number] - copies
if remaining > 0:
factors[number] = remaining
else:
del factors[number]
output *= number ** copies
return (output, factors)
return None # failed
# Find prime factors. You can use any formula you want for this.
# I am using elliptic curve factorization from http://sourceforge.net/projects/pyecm
import pyecm, collections, copy
getFactors_cache = {}
def getFactors(n):
assert(n != 0)
# attempt to retrieve from cache. Returns a copy
try:
return copy.copy(getFactors_cache[n])
except KeyError:
pass
output = collections.defaultdict(int)
for factor in pyecm.factors(n, False, True, 10, 1):
output[factor] += 1
# cache result
getFactors_cache[n] = output
return copy.copy(output)
Answer to first question
You say you want compression of numbers, but from your examples, those sequences are longer than the undecomposed numbers. It is not possible to compress these numbers without more details to the system you left out (probability of sequences/is there a programmable client?). Could you elaborate more?
Here is a mathematical explanation as to why current answers to the first part of your problem will never solve your second problem. It has nothing to do with the knapsack problem.
This is Shannon's entropy algorithm. It tells you the theoretical minimum amount of bits you need to represent a sequence {X0, X1, X2, ..., Xn-1, Xn} where p(Xi) is the probability of seeing token Xi.
Let's say that X0 to Xn is the span of 0 to 4294967295 (the range of an integer). From what you have described, each number is as likely as another to appear. Therefore the probability of each element is 1/4294967296.
When we plug it into Shannon's algorithm, it will tell us what the minimum number of bits are required to represent the stream.
import math
def entropy():
num = 2**32
probability = 1./num
return -(num) * probability * math.log(probability, 2)
# the (num) * probability cancels out
The entropy unsurprisingly is 32. We require 32 bits to represent an integer where each number is equally likely. The only way to reduce this number, is to increase the probability of some numbers, and decrease the probability of others. You should explain the stream in more detail.
Answer to second question
The right way to do this is to use base64, when communicating with HTTP. Apparently Java does not have this in the standard library, but I found a link to a free implementation:
http://iharder.sourceforge.net/current/java/base64/
Here is the "pseudo-code" which works perfectly in Python and should not be difficult to convert to Java (my Java is rusty):
def longTo64(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
output = ""
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % 64] + output
num /= 64
return output
If you have control over your web server and web client, and can parse the entire HTTP requests without problem, you can upgrade to base85. According to wikipedia, url encoding allows for up to 85 characters. Otherwise, you may need to remove a few characters from the mapping.
Here is another code example in Python
def longTo85(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = ""
base = len(mapping)
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % base] + output
num /= base
return output
And here is the inverse operation:
def stringToLong(string):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = 0
base = len(mapping)
place = 0
# check each digit from the lowest place
for digit in reversed(string):
# find the number the mapping of symbol to number, then multiply by base^place
output += mapping.find(digit) * (base ** place)
place += 1
return output
Here is a graph of Shannon's algorithm in different bases.
As you can see, the higher the radix, the less symbols are needed to represent a number. At base64, ~11 symbols are required to represent a long. At base85, it becomes ~10 symbols.
Edit after final explanation:
I would think base64 is the best solution, since there are standard functions that deal with it, and variants of this idea don't give much improvement. This was answered with much more detail by others here.
Regarding the original question, although the code works, it is not guaranteed to run in any reasonable time, as was answered as well as commented on this question by LFSR Consulting.
Original Answer:
You mean something like this?
Edit - corrected after a comment.
shortest_output = {}
foreach (int R = 0; R <= X; R++) {
// iteration over possible remainders
// check if the rest of X can be decomposed into multipliers
newX = X - R;
output = {};
while (newX > Y) {
int i;
for (i = Y; i > 1; i--) {
if ( newX % i == 0) { // found a divider
output.append(i);
newX = newX /i;
break;
}
}
if (i == 1) { // no dividers <= Y
break;
}
}
if (newX != 1) {
// couldn't find dividers with no remainder
output.clear();
}
else {
output.append(R);
if (output.length() < shortest_output.length()) {
shortest_output = output;
}
}
}
It sounds as though you want to compress random data -- this is impossible for information theoretic reasons. (See http://www.faqs.org/faqs/compression-faq/part1/preamble.html question 9.) Use Base64 on the concatenated binary representations of your numbers and be done with it.
The problem you're attempting to solve (you're dealing with a subset of the problem, given you're restriction of y) is called Integer Factorization and it cannot be done efficiently given any known algorithm:
In number theory, integer factorization is the breaking down of a composite number into smaller non-trivial divisors, which when multiplied together equal the original integer.
This problem is what makes a number of cryptographic functions possible (namely RSA which uses 128 bit keys - long is half of that.) The wiki page contains some good resources that should move you in the right direction with your problem.
So, your brain teaser is indeed a brain teaser... and if you solve it efficiently we can elevate your math skills to above average!
Updated after the full story
Base64 is most likely your best option. If you want a custom solution you can try implementing a Base 65+ system. Just remember that just because 10000 can be written as "10^4" doesn't mean that everything can be written as 10^n where n is an integer. Different base systems are the simplest way to write numbers and the higher the base the less digits the number requires. Plus most framework libraries contain algorithms for Base64 encoding. (What language you are using?).
One way to further pack the urls is the one you mentioned but in Base64.
int[] IDs;
IDs.sort() // So IDs[i] is always smaller or equal to IDs[i-1].
string url = Base64Encode(IDs[0]);
for (int i = 1; i < IDs.length; i++) {
url += "," + Base64Encode(IDs[i-1] - IDs[i]);
}
Note that you require some separator as the initial ID can be arbitrarily large and the difference between two IDs CAN be more than 63 in which case one Base64 digit is not enough.
Updated
Just restating that the problem is unsolvable. For Y = 64 you can't write 87681 in multipliers + remainder where each of these is below 64. In other words, you cannot write any of the numbers 87617..87681 with multipliers that are below 64. Each of these numbers has an elementary term over 64. 87616 can be written in elementary terms below 64 but then you'd need those + 65 and so the remainder will be over 64.
So if this was just a brainteaser, it's unsolvable. Was there some practical purpose for this which could be achieved in some way other than using multiplication and a remainder?
And yes, this really should be a comment but I lost my ability to comment at some point. :p
I believe the solution which comes closest is Yevgeny's. It is also easy to extend Yevgeny's solution to remove the limit for the remainder in which case it would be able to find solution where multipliers are smaller than Y and remainder as small as possible, even if greater than Y.
Old answer:
If you limit that every number in the array must be below the y then there is no solution for this. Given large enough x and small enough y, you'll end up in an impossible situation. As an example with y of 2, x of 12 you'll get 2 * 2 * 2 + 4 as 2 * 2 * 2 * 2 would be 16. Even if you allow negative numbers with abs(n) below y that wouldn't work as you'd need 2 * 2 * 2 * 2 - 4 in the above example.
And I think the problem is NP-Complete even if you limit the problem to inputs which are known to have an answer where the last term is less than y. It sounds quite much like the [Knapsack problem][1]. Of course I could be wrong there.
Edit:
Without more accurate problem description it is hard to solve the problem, but one variant could work in the following way:
set current = x
Break current to its terms
If one of the terms is greater than y the current number cannot be described in terms greater than y. Reduce one from current and repeat from 2.
Current number can be expressed in terms less than y.
Calculate remainder
Combine as many of the terms as possible.
(Yevgeny Doctor has more conscise (and working) implementation of this so to prevent confusion I've skipped the implementation.)
OP Wrote:
My goal originally was to come up with
a specific method to pack 1..n large
integers (aka longs) together so that
their String representation is notably
shorter than writing the actual
number. Think multiples of ten, 10^6
and 1 000 000 are the same, however
the representation's length in
characters isn't.
I have been down that path before, and as fun as it was to learn all the math, to save you time I will just point you to: http://en.wikipedia.org/wiki/Kolmogorov_complexity
In a nutshell some strings can be easily compressed by changing your notation:
10^9 (4 characters) = 1000000000 (10 characters)
Others cannot:
7829203478 = some random number...
This is a great great simplification of the article I linked to above, so I recommend that you read it instead of taking my explanation at face value.
Edit:
If you are trying to make RESTful urls for some set of unique data, why wouldn't you use a hash, such as MD5? Then include the hash as part of the URL, then look up the data based on the hash. Or am I missing something obvious?
The original method you chose (a * b + c * d + e) would be very difficult to find optimal solutions for simply due to the large search space of possibilities. You could factorize the number but it's that "+ e" that complicates things since you need to factorize not just that number but quite a few immediately below it.
Two methods for compression spring immediately to mind, both of which give you a much-better-than-10% saving on space from the numeric representation.
A 64-bit number ranges from (unsigned):
0 to
18,446,744,073,709,551,616
or (signed):
-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807
In both cases, you need to reduce the 20-characters taken (without commas) to something a little smaller.
The first is to simply BCD-ify the number the base64 encode it (actually a slightly modified base64 since "/" would not be kosher in a URL - you should use one of the acceptable characters such as "_").
Converting it to BCD will store two digits (or a sign and a digit) into one byte, giving you an immediate 50% reduction in space (10 bytes). Encoding it base 64 (which turns every 3 bytes into 4 base64 characters) will turn the first 9 bytes into 12 characters and that tenth byte into 2 characters, for a total of 14 characters - that's a 30% saving.
The only better method is to just base64 encode the binary representation. This is better because BCD has a small amount of wastage (each digit only needs about 3.32 bits to store [log210], but BCD uses 4).
Working on the binary representation, we only need to base64 encode the 64-bit number (8 bytes). That needs 8 characters for the first 6 bytes and 3 characters for the final 2 bytes. That's 11 characters of base64 for a saving of 45%.
If you wanted maximum compression, there are 73 characters available for URL encoding:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789$-_.+!*'(),
so technically you could probably encode base-73 which, from rough calculations, would still take up 11 characters, but with more complex code which isn't worth it in my opinion.
Of course, that's the maximum compression due to the maximum values. At the other end of the scale (1-digit) this encoding actually results in more data (expansion rather than compression). You can see the improvements only start for numbers over 999, where 4 digits can be turned into 3 base64 characters:
Range (bytes) Chars Base64 chars Compression ratio
------------- ----- ------------ -----------------
< 10 (1) 1 2 -100%
< 100 (1) 2 2 0%
< 1000 (2) 3 3 0%
< 10^4 (2) 4 3 25%
< 10^5 (3) 5 4 20%
< 10^6 (3) 6 4 33%
< 10^7 (3) 7 4 42%
< 10^8 (4) 8 6 25%
< 10^9 (4) 9 6 33%
< 10^10 (5) 10 7 30%
< 10^11 (5) 11 7 36%
< 10^12 (5) 12 7 41%
< 10^13 (6) 13 8 38%
< 10^14 (6) 14 8 42%
< 10^15 (7) 15 10 33%
< 10^16 (7) 16 10 37%
< 10^17 (8) 17 11 35%
< 10^18 (8) 18 11 38%
< 10^19 (8) 19 11 42%
< 2^64 (8) 20 11 45%
Update: I didn't get everything, thus I rewrote the whole thing in a more Java-Style fashion. I didn't think of the prime number case that is bigger than the divisor. This is fixed now. I leave the original code in order to get the idea.
Update 2: I now handle the case of the big prime number in another fashion . This way a result is obtained either way.
public final class PrimeNumberException extends Exception {
private final long primeNumber;
public PrimeNumberException(long x) {
primeNumber = x;
}
public long getPrimeNumber() {
return primeNumber;
}
}
public static Long[] decompose(long x, long y) {
try {
final ArrayList<Long> operands = new ArrayList<Long>(1000);
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
} catch (PrimeNumberException e) {
// return new Long[0];
// or do whatever you like, for example
operands.add(e.getPrimeNumber());
} finally {
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
}
// The recursive method
private static void recDivide(long x, long y, ArrayList<Long> operands)
throws PrimeNumberException {
while ((x > y) && (y != 1)) {
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
}
// go in recursion
x = rest;
} else {
// if the value x isn't divisible by y decrement y so you'll find a
// divisor eventually
if (--y == 1) {
throw new PrimeNumberException(x);
}
}
}
}
Original: Here some recursive code I came up with. I would have preferred to code it in some functional language but it was required in Java. I didn't bother converting the numbers to integer but that shouldn't be that hard (yes, I'm lazy ;)
public static Long[] decompose(long x, long y) {
final ArrayList<Long> operands = new ArrayList<Long>();
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
// The recursive method
private static void recDivide(long newX, long y, ArrayList<Long> operands) {
long x = newX;
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
} else {
// the rest can still be divided, go one level deeper in recursion
recDivide(rest, y, operands);
}
} else {
// if the value x isn't divisible by y decrement y so you'll find a divisor
// eventually
recDivide(x, y-1, operands);
}
}
Are you married to using Java? Python has an entire package dedicated just for this exact purpose. It'll even sanitize the encoding for you to be URL-safe.
Native Python solution
The standard module I'm recommending is base64, which converts arbitrary stings of chars into sanitized base64 format. You can use it in conjunction with the pickle module, which handles conversion from lists of longs (actually arbitrary size) to a compressed string representation.
The following code should work on any vanilla installation of Python:
import base64
import pickle
# get some long list of numbers
a = (854183415,1270335149,228790978,1610119503,1785730631,2084495271,
1180819741,1200564070,1594464081,1312769708,491733762,243961400,
655643948,1950847733,492757139,1373886707,336679529,591953597,
2007045617,1653638786)
# this gets you the url-safe string
str64 = base64.urlsafe_b64encode(pickle.dumps(a,-1))
print str64
>>> gAIoSvfN6TJKrca3S0rCEqMNSk95-F9KRxZwakqn3z58Sh3hYUZKZiePR0pRlwlfSqxGP05KAkNPHUo4jooOSixVFCdK9ZJHdEqT4F4dSvPY41FKaVIRFEq9fkgjSvEVoXdKgoaQYnRxAC4=
# this unwinds it
a64 = pickle.loads(base64.urlsafe_b64decode(str64))
print a64
>>> (854183415, 1270335149, 228790978, 1610119503, 1785730631, 2084495271, 1180819741, 1200564070, 1594464081, 1312769708, 491733762, 243961400, 655643948, 1950847733, 492757139, 1373886707, 336679529, 591953597, 2007045617, 1653638786)
Hope that helps. Using Python is probably the closest you'll get from a 1-line solution.
Wrt the original algorithm request: Is there a limit on the size of the last number (beyond that it must be stored in a 32b int)?
(The original request is all I'm able to tackle lol.)
The one that produces the shortest list is:
bool negative=(n<1)?true:false;
int j=n%y;
if(n==0 || n==1)
{
list.append(n);
return;
}
while((long64)(n-j*y)>MAX_INT && y>1) //R has to be stored in int32
{
y--;
j=n%y;
}
if(y<=1)
fail //Number has no suitable candidate factors. This shouldn't happen
int i=0;
for(;i<j;i++)
{
list.append(y);
}
list.append(n-y*j);
if(negative)
list[0]*=-1;
return;
A little simplistic compared to most answers given so far but it achieves the desired functionality of the original post... It's a little dirty but hopefully useful :)
Isn't this modulus?
Let / be integer division (whole numbers) and % be modulo.
int result[3];
result[0] = y;
result[1] = x / y;
result[2] = x % y;
Just set x:=x/n where n is the largest number that is less both than x and y. When you end up with x<=y, this is your last number in the sequence.
Like in my comment above, I'm not sure I understand exactly the question. But assuming integers (n and a given y), this should work for the cases you stated:
multipliers[0] = n / y;
multipliers[1] = y;
addedNumber = n % y;
How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.
My idea is the following:
Create an array of 256 integers (all zeros).
Traverse through the file and for each of its bytes,
increment the corresponding position in the array.
At the end: Calculate the "average" value for the array.
Initialize a counter with zero,
and for each of the array's entries:
add the entry's difference
to "average" to the counter.
Well, now I'm stuck. How to "project" the counter result in such a way
that all results would lie between 0.0 and 1.0? But I'm sure,
the idea is inconsistent anyway...
I hope someone has better and simpler solutions?
Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)
At the end: Calculate the "average" value for the array.
Initialize a counter with zero,
and for each of the array's entries:
add the entry's difference to "average" to the counter.
With some modifications you can get Shannon's entropy:
rename "average" to "entropy"
(float) entropy = 0
for i in the array[256]:Counts do
(float)p = Counts[i] / filesize
if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2
Edit:
As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).
A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).
This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.
To calculate the information entropy of a collection of bytes, you'll need to do something similar to tydok's answer. (tydok's answer works on a collection of bits.)
The following variables are assumed to already exist:
byte_counts is 256-element list of the number of bytes with each value in your file. For example, byte_counts[2] is the number of bytes that have the value 2.
total is the total number of bytes in your file.
I'll write the following code in Python, but it should be obvious what's going on.
import math
entropy = 0
for count in byte_counts:
# If no bytes of this value were seen in the value, it doesn't affect
# the entropy of the file.
if count == 0:
continue
# p is the probability of seeing this byte in the file, as a floating-
# point number
p = 1.0 * count / total
entropy -= p * math.log(p, 256)
There are several things that are important to note.
The check for count == 0 is not just an optimization. If count == 0, then p == 0, and log(p) will be undefined ("negative infinity"), causing an error.
The 256 in the call to math.log represents the number of discrete values that are possible. A byte composed of eight bits will have 256 possible values.
The resulting value will be between 0 (every single byte in the file is the same) up to 1 (the bytes are evenly divided among every possible value of a byte).
An explanation for the use of log base 256
It is true that this algorithm is usually applied using log base 2. This gives the resulting answer in bits. In such a case, you have a maximum of 8 bits of entropy for any given file. Try it yourself: maximize the entropy of the input by making byte_counts a list of all 1 or 2 or 100. When the bytes of a file are evenly distributed, you'll find that there is an entropy of 8 bits.
It is possible to use other logarithm bases. Using b=2 allows a result in bits, as each bit can have 2 values. Using b=10 puts the result in dits, or decimal bits, as there are 10 possible values for each dit. Using b=256 will give the result in bytes, as each byte can have one of 256 discrete values.
Interestingly, using log identities, you can work out how to convert the resulting entropy between units. Any result obtained in units of bits can be converted to units of bytes by dividing by 8. As an interesting, intentional side-effect, this gives the entropy as a value between 0 and 1.
In summary:
You can use various units to express entropy
Most people express entropy in bits (b=2)
For a collection of bytes, this gives a maximum entropy of 8 bits
Since the asker wants a result between 0 and 1, divide this result by 8 for a meaningful value
The algorithm above calculates entropy in bytes (b=256)
This is equivalent to (entropy in bits) / 8
This already gives a value between 0 and 1
I'm two years late in answering, so please consider this despite only a few up-votes.
Short answer: use my 1st and 3rd bold equations below to get what most people are thinking about when they say "entropy" of a file in bits. Use just 1st equation if you want Shannon's H entropy which is actually entropy/symbol as he stated 13 times in his paper which most people are not aware of. Some online entropy calculators use this one, but Shannon's H is "specific entropy", not "total entropy" which has caused so much confusion. Use 1st and 2nd equation if you want the answer between 0 and 1 which is normalized entropy/symbol (it's not bits/symbol, but a true statistical measure of the "entropic nature" of the data by letting the data choose its own log base instead of arbitrarily assigning 2, e, or 10).
There 4 types of entropy of files (data) of N symbols long with n unique types of symbols. But keep in mind that by knowing the contents of a file, you know the state it is in and therefore S=0. To be precise, if you have a source that generates a lot of data that you have access to, then you can calculate the expected future entropy/character of that source. If you use the following on a file, it is more accurate to say it is estimating the expected entropy of other files from that source.
Shannon (specific) entropy H = -1*sum(count_i / N * log(count_i / N))
where count_i is the number of times symbol i occured in N.
Units are bits/symbol if log is base 2, nats/symbol if natural log.
Normalized specific entropy: H / log(n)
Units are entropy/symbol. Ranges from 0 to 1. 1 means each symbol occurred equally often and near 0 is where all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. The log is in the same base as the H.
Absolute entropy S = N * H
Units are bits if log is base 2, nats if ln()).
Normalized absolute entropy S = N * H / log(n)
Unit is "entropy", varies from 0 to N. The log is in the same base as the H.
Although the last one is the truest "entropy", the first one (Shannon entropy H) is what all books call "entropy" without (the needed IMHO) qualification. Most do not clarify (like Shannon did) that it is bits/symbol or entropy per symbol. Calling H "entropy" is speaking too loosely.
For files with equal frequency of each symbol: S = N * H = N. This is the case for most large files of bits. Entropy does not do any compression on the data and is thereby completely ignorant of any patterns, so 000000111111 has the same H and S as 010111101000 (6 1's and 6 0's in both cases).
Like others have said, using a standard compression routine like gzip and dividing before and after will give a better measure of the amount of pre-existing "order" in the file, but that is biased against data that fits the compression scheme better. There's no general purpose perfectly optimized compressor that we can use to define an absolute "order".
Another thing to consider: H changes if you change how you express the data. H will be different if you select different groupings of bits (bits, nibbles, bytes, or hex). So you divide by log(n) where n is the number of unique symbols in the data (2 for binary, 256 for bytes) and H will range from 0 to 1 (this is normalized intensive Shannon entropy in units of entropy per symbol). But technically if only 100 of the 256 types of bytes occur, then n=100, not 256.
H is an "intensive" entropy, i.e. it is per symbol which is analogous to specific entropy in physics which is entropy per kg or per mole. Regular "extensive" entropy of a file analogous to physics' S is S=N*H where N is the number of symbols in the file. H would be exactly analogous to a portion of an ideal gas volume. Information entropy can't simply be made exactly equal to physical entropy in a deeper sense because physical entropy allows for "ordered" as well disordered arrangements: physical entropy comes out more than a completely random entropy (such as a compressed file). One aspect of the different For an ideal gas there is a additional 5/2 factor to account for this: S = k * N * (H+5/2) where H = possible quantum states per molecule = (xp)^3/hbar * 2 * sigma^2 where x=width of the box, p=total non-directional momentum in the system (calculated from kinetic energy and mass per molecule), and sigma=0.341 in keeping with uncertainty principle giving only the number of possible states within 1 std dev.
A little math gives a shorter form of normalized extensive entropy for a file:
S=N * H / log(n) = sum(count_i*log(N/count_i))/log(n)
Units of this are "entropy" (which is not really a unit). It is normalized to be a better universal measure than the "entropy" units of N * H. But it also should not be called "entropy" without clarification because the normal historical convention is to erringly call H "entropy" (which is contrary to the clarifications made in Shannon's text).
For what it's worth, here's the traditional (bits of entropy) calculation represented in C#:
/// <summary>
/// returns bits of entropy represented in a given string, per
/// http://en.wikipedia.org/wiki/Entropy_(information_theory)
/// </summary>
public static double ShannonEntropy(string s)
{
var map = new Dictionary<char, int>();
foreach (char c in s)
{
if (!map.ContainsKey(c))
map.Add(c, 1);
else
map[c] += 1;
}
double result = 0.0;
int len = s.Length;
foreach (var item in map)
{
var frequency = (double)item.Value / len;
result -= frequency * (Math.Log(frequency) / Math.Log(2));
}
return result;
}
Is this something that ent could handle? (Or perhaps its not available on your platform.)
$ dd if=/dev/urandom of=file bs=1024 count=10
$ ent file
Entropy = 7.983185 bits per byte.
...
As a counter example, here is a file with no entropy.
$ dd if=/dev/zero of=file bs=1024 count=10
$ ent file
Entropy = 0.000000 bits per byte.
...
There's no such thing as the entropy of a file. In information theory, the entropy is a function of a random variable, not of a fixed data set (well, technically a fixed data set does have an entropy, but that entropy would be 0 — we can regard the data as a random distribution that has only one possible outcome with probability 1).
In order to calculate the entropy, you need a random variable with which to model your file. The entropy will then be the entropy of the distribution of that random variable. This entropy will equal the number of bits of information contained in that random variable.
If you use information theory entropy, mind that it might make sense not to use it on bytes. Say, if your data consists of floats you should instead fit a probability distribution to those floats and calculate the entropy of that distribution.
Or, if the contents of the file is unicode characters, you should use those, etc.
Calculates entropy of any string of unsigned chars of size "length". This is basically a refactoring of the code found at http://rosettacode.org/wiki/Entropy. I use this for a 64 bit IV generator that creates a container of 100000000 IV's with no dupes and a average entropy of 3.9. http://www.quantifiedtechnologies.com/Programming.html
#include <string>
#include <map>
#include <algorithm>
#include <cmath>
typedef unsigned char uint8;
double Calculate(uint8 * input, int length)
{
std::map<char, int> frequencies;
for (int i = 0; i < length; ++i)
frequencies[input[i]] ++;
double infocontent = 0;
for (std::pair<char, int> p : frequencies)
{
double freq = static_cast<double>(p.second) / length;
infocontent += freq * log2(freq);
}
infocontent *= -1;
return infocontent;
}
Re: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)
As others have pointed out (or been confused/distracted by), I think you're actually talking about metric entropy (entropy divided by length of message). See more at Entropy (information theory) - Wikipedia.
jitter's comment linking to Scanning data for entropy anomalies is very relevant to your underlying goal. That links eventually to libdisorder (C library for measuring byte entropy). That approach would seem to give you lots more information to work with, since it shows how the metric entropy varies in different parts of the file. See e.g. this graph of how the entropy of a block of 256 consecutive bytes from a 4 MB jpg image (y axis) changes for different offsets (x axis). At the beginning and end the entropy is lower, as it part-way in, but it is about 7 bits per byte for most of the file.
Source: https://github.com/cyphunk/entropy_examples. [Note that this and other graphs are available via the novel http://nonwhiteheterosexualmalelicense.org license....]
More interesting is the analysis and similar graphs at Analysing the byte entropy of a FAT formatted disk | GL.IB.LY
Statistics like the max, min, mode, and standard deviation of the metric entropy for the whole file and/or the first and last blocks of it might be very helpful as a signature.
This book also seems relevant: Detection and Recognition of File Masquerading for E-mail and Data Security - Springer
Here's a Java algo based on this snippet and the invasion that took place during the infinity war
public static double shannon_entropy(File file) throws IOException {
byte[] bytes= Files.readAllBytes(file.toPath());//byte sequence
int max_byte = 255;//max byte value
int no_bytes = bytes.length;//file length
int[] freq = new int[256];//byte frequencies
for (int j = 0; j < no_bytes; j++) {
int value = bytes[j] & 0xFF;//integer value of byte
freq[value]++;
}
double entropy = 0.0;
for (int i = 0; i <= max_byte; i++) {
double p = 1.0 * freq[i] / no_bytes;
if (freq[i] > 0)
entropy -= p * Math.log(p) / Math.log(2);
}
return entropy;
}
usage-example:
File file=new File("C:\\Users\\Somewhere\\In\\The\\Omniverse\\Thanos Invasion.Log");
int file_length=(int)file.length();
double shannon_entropy=shannon_entropy(file);
System.out.println("file length: "+file_length+" bytes");
System.out.println("shannon entropy: "+shannon_entropy+" nats i.e. a minimum of "+shannon_entropy+" bits can be used to encode each byte transfer" +
"\nfrom the file so that in total we transfer atleast "+(file_length*shannon_entropy)+" bits ("+((file_length*shannon_entropy)/8D)+
" bytes instead of "+file_length+" bytes).");
output-example:
file length: 5412 bytes
shannon entropy: 4.537883805240875 nats i.e. a minimum of 4.537883805240875 bits can be used to encode each byte transfer
from the file so that in total we transfer atleast 24559.027153963616 bits (3069.878394245452 bytes instead of 5412 bytes).
Without any additional information entropy of a file is (by definition) equal to its size*8 bits. Entropy of text file is roughly size*6.6 bits, given that:
each character is equally probable
there are 95 printable characters in byte
log(95)/log(2) = 6.6
Entropy of text file in English is estimated to be around 0.6 to 1.3 bits per character (as explained here).
In general you cannot talk about entropy of a given file. Entropy is a property of a set of files.
If you need an entropy (or entropy per byte, to be exact) the best way is to compress it using gzip, bz2, rar or any other strong compression, and then divide compressed size by uncompressed size. It would be a great estimate of entropy.
Calculating entropy byte by byte as Nick Dandoulakis suggested gives a very poor estimate, because it assumes every byte is independent. In text files, for example, it is much more probable to have a small letter after a letter than a whitespace or punctuation after a letter, since words typically are longer than 2 characters. So probability of next character being in a-z range is correlated with value of previous character. Don't use Nick's rough estimate for any real data, use gzip compression ratio instead.