Stacked MPI derived data types in Fortran - matrix

MPI2 allows us to create derived data types and send them by writing
call mpi_type_create_indexed_block(size,1,dspl_send,rtype,DerType,ierr)
call mpi_send(data,1,DerType,jRank,20,comm,ierr)
By doing this the position dspl_send of data(N) are sent by the MPI library.
Now, for a matrix data(M,N) we can send its position via the following code:
call mpi_type_create_indexed_block(size,M,dspl_send,rtype,DerTypeM,ierr)
call mpi_send(data,1,DerTypeM,jRank,20,comm,ierr)
That is the entries data(i, dspl_send(j)) are sent.
My question concern the role of the 1 in the subsequent mpi_send. Does it has always to be 1? Is another size possible? MPI derived data types are explained nicely in many documents on the internet, but always the size in send/recv is 1 without mention if another size is allowed and then how it could be used.
If we want to work with matrices data(M,N) with a size M that varies between calls, do we need to always create a derived data type whenever we call it? Is it impossible to use DerType for sending a matrix data(M,N) or data(N,M)?

Each MPI datatype has two properties: size and extent. The size is the actual number of bytes that the datatype represent while the extent is the number of bytes that the datatype covers in memory. Some datatypes are not contiguous, which means that their size might be less than their extent, e.g. (shown here in pseudocode)
MPI_TYPE_VECTOR(count = 1,
blocklength = 10,
stride = 20,
oldtype = MPI_INTEGER,
newtype = newtype)
creates a datatype that takes the first 10 (blocklength) elements from a total of 20 (stride). This datatype has a size of 10 times the size of MPI_INTEGER which counts to 40 bytes on most systems. Its extent is two times larger or 80 bytes on most systems. If count was 2, then it would take 10 elements, then skip the next 10, then take another 10 elements and once again skip the next 10. Consequently its size and its extend would be twice as larger.
When you specify a certain element count in any MPI routine, e.g. MPI_SEND, MPI does something like this:
It initialises the internal data buffer with the address of the source buffer argument.
It consults the datatype type map to decide how many bytes and from where to take and appends them to the message being constructed. The number of bytes added equals the size of the datatype.
It increments the internal data pointer by the extent of the datatype.
It decrements the internal count and if it is still non-zero, repeats the previous two steps.
One nifty feature of MPI is that the extent of the datatype is not required to match its size (as shown in the vector example) and one can even bestow whatever value of the extent that he wants on the datatype using MPI_TYPE_CREATE_RESIZED. This allows for very complex data access patterns to be created. For example, using MPI_SCATTERV to scatter a matrix by blocks that do not span entire rows (C) or columns (Fortran) requires the use of such resized types.
Back to the vector example. Whether you create a vector type with count = 1 and then call MPI_SEND with count = 2 or you create a vector type with count = 2 and then call MPI_SEND with count = 1, the end result is the same. Often one constructs a datatype that fully describes the object that one wants to send. In this case one gives count = 1 in the call to MPI_SEND. But there are cases when it might be more beneficial to create a datatype that describes only a portion of the object, for example a single part, and then call MPI_SEND with count set to the number of parts that one wants to send. Sometimes it is a matter of personal preferences, sometimes it is a matter of algorithmic requirements.
As to your last question, Fortran stores matrices in column-major order, which means that data(i,j) is next to data(i±1,j) in memory and not to data(i,j±1). Consequently, data(M,N) consists of N consecutive column-vectors of M elements each. The distance between two elements, for example data(1,1) and data(1,2) depends on M. That's why you supply M in the type constructor. Matrices with different number of rows (e.g. different M) would not "fit" the type map of the created type and the wrong elements would be used to construct the message.

The description about extent in https://stackoverflow.com/a/13802243/7784768 is not entirely correct, as the extent does not take into account the padding in the end of datatype. MPI datatypes are defined by typemap:
typemap = ((type_0, disp_0 ), ..., (type_n−1, disp_n−1 ))
Extent is then defined according to
lb = min(disp_j)
ub = max(disp_j + sizeof(type_j)) + e)
extent = ub - lb,
where e can be non-zero due alignment requirements.
This means that in the example
MPI_TYPE_VECTOR(count = 1,
blocklength = 10,
stride = 20,
oldtype = MPI_INTEGER,
newtype = newtype)
with count=1, typemap is
((int, 0), (int, 4), ... (int, 36))
and extent is in most systems 40 and not 80 (i.e. stride has no effect for the typemap in this case). For count=2, typemap would be
((int, 0), (int, 4), ... (int, 36), (int, 80), (int, 84), ... (int, 116))
and extent 120 (40 bytes for the first block of 10 integers, 40 bytes for the stride, and 40 bytes for the second block of 10 integers, but the remaining stride is neglected in the extent). One can easily find out the extent with the MPI_Type_get_extent function.
Extent is quite tricky concept, and it is easy to make mistakes when trying to communicate multiple elements of derived datatype.

Related

Algorithm to convert offset pagination to page number pagination

I have a service that must receive pagnation queries in offset format, receiving offset and limit parameters. For example, if I receive offset=5&limit=10, I would expect to receive items 5-14 back. I am able to enforce some validation against these parameters, for example to set a maximum value for limit.
My data source must receive pagination requests in page number format, receiving page_number and page_size parameters. For example, if I send page_number=0&page_size=20, I would receive items 0-19. The data source has a maximum page_size of 100.
I need to be able to take the offset pagination parameters I receive and use them to determine appropriate values for the page_number and page_size parameters, in order to return a range from the data source that includes all of the items I need. Additional items may be returned padding out the start and/or end of the range, which can then be filtered out to produce the requested range.
If possible, I should only make a single request to the data source. Optionally, performance can be improved by minimising the size of the range to be requested from the datasource (i.e. fetching 10 items to satisfy a request for 8 items is more efficient than requesting 100 items for the same).
This feels like it should be relatively simple to achieve, but my attempts at simple mathematical solutions don't address all of the edge cases, and my attempts at more robust solutions have started to head into the more complex space of calculating and iterating over factors etc.
Is there a simple way to calculate the appropriate values?
I've put together a test harness REPL with a set of example test cases that makes it easy to trial different implementations here.
An appropriate implementation is to start at page_size=limit, and then increment page_size until there's a single page that contains the whole range from offset to offset+limit.
If you're thinking that you don't want to waste time iterating, then consider that the time taken by this method is at most proportional to the size of the result set, and it's completely insignificant comparted to the time you'll take reading, marshaling, unmarshaling, and processing the results themselves.
I've tested all combinations with offset+limit <= 100000. In all cases, page_size <= 2*limit+20. For large limits, the worst case overhead always occurs when limit=offset+1. At some point it will become more efficient to make 2 requests. You should check for that.
How about this ?
if (limit > 100) {
// return error for exceeding limit...
}
mod_offset = (offset % limit)
page_number = (offset / limit) ;
page_size = limit;
// Sample test cases ..
// offset=25, limit=20 .. mod_offset = 5
(a) page_number = 1, page_size = 20 // skip first 'n' values equal to 'mod_offset'
(b) page_number = 1+1 = 2, page_size = 20 // include only first 'n' values equal to 'mod_offset'
// offset=50, limit=25 .. mod_offset = 0
(a) page_number = 2, page_size = 25 // if offset is multiple of limit, no need to fetch twice...
// offset=125, limit=20 .. mod_offset = 5
(a) page_number = 6, page_size = 20 // skip first 'n' values equal to 'mod_offset'
(b) page_number = 6+1 = 7, page_size = 20 // include only first 'n' values equal to 'mod_offset'

Making a 'hash' of two floats

I have two 32-bit floating point numbers. I want to keep a count of how often any combination of the two occurs. I could technically do this by concatenating them into a string and use a regular hash map to keep track of the count, but the overhead of that is considerable in my application, so I was thinking if there would be a better way. I don't need to keep the full precision of a 32 bit float, and I know that one number is never > 10, and the other never > 100. So I could technically multiple the first by 10000, the second by 1000, cast the result to int to chop off anything after the comma, bit-shift the first nr 16 bits and & them together into an integer. I could then allocate an array of MAX_INT elements and use the integer I just created as an index into that array.
However, that would leave me with a 2GB array, most of which would be empty, so I'd like to avoid that. I was wondering if there are any hashing algorithms that go about this in a more sophisticated way, or any data structures that work in a 'tiered' way, like a tree where a lookup is first done on a combination of the first digits of each number, then on the second numbers and so on, so that no room needs to be allocated for any combinations that aren't known yet. (There is probably a problem with this exact approach, it's just an example of the direction I'm thinking in). Or any other way is fine too - like maybe a more sophisticated way of hashing two floats together, in such a way that the result is scaled between 0 and some number, where the chose of that 'some number' would give me a way to tune the max size of the lookup table in memory.
Any ideas?
You could copy the floats to one buffer and than hash that buffer, like this (this is c/c++):
int hashNumbers(float a, float b){
char bytes[2 * sizeof(float)];
memcpy(bytes, &a, sizeof(float));
memcpy(bytes + sizeof(float), &b, sizeof(float));
//I don't know your implementation of the actual hash function
return hash(bytes, 2 * sizeof(float)); //assuming the hash function would take in an array of bytes with its size.
//I don't know it's output type here. I'm assuming it's an int.
}
As for your has function you could use the modulo, if you for example just take the int which has 2^32 possibilities and that do x = x % 100, that would leave you with 100 possibilities as to what x could be.
If instead of 100 you would take a number which is a power of 2(2, 4, 8, 16, etc). You could accelerate this by instead of using modulo, you use bitwise operations. By doing: x = x >> 24. You now only have 256 possibilities for x.

Sorting 1 million 8-decimal-digit numbers with 1 MB of RAM

I have a computer with 1 MB of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the sorted list out over another TCP connection.
The list of numbers may contain duplicates, which I must not discard. The code will be placed in ROM, so I need not subtract the size of my code from the 1 MB. I already have code to drive the Ethernet port and handle TCP/IP connections, and it requires 2 KB for its state data, including a 1 KB buffer via which the code will read and write data. Is there a solution to this problem?
Sources Of Question And Answer:
slashdot.org
cleaton.net
There is one rather sneaky trick not mentioned here so far. We assume that you have no extra way to store data, but that is not strictly true.
One way around your problem is to do the following horrible thing, which should not be attempted by anyone under any circumstances: Use the network traffic to store data. And no, I don't mean NAS.
You can sort the numbers with only a few bytes of RAM in the following way:
First take 2 variables: COUNTER and VALUE.
First set all registers to 0;
Every time you receive an integer I, increment COUNTER and set VALUE to max(VALUE, I);
Then send an ICMP echo request packet with data set to I to the router. Erase I and repeat.
Every time you receive the returned ICMP packet, you simply extract the integer and send it back out again in another echo request. This produces a huge number of ICMP requests scuttling backward and forward containing the integers.
Once COUNTER reaches 1000000, you have all of the values stored in the incessant stream of ICMP requests, and VALUE now contains the maximum integer. Pick some threshold T >> 1000000. Set COUNTER to zero. Every time you receive an ICMP packet, increment COUNTER and send the contained integer I back out in another echo request, unless I=VALUE, in which case transmit it to the destination for the sorted integers. Once COUNTER=T, decrement VALUE by 1, reset COUNTER to zero and repeat. Once VALUE reaches zero you should have transmitted all integers in order from largest to smallest to the destination, and have only used about 47 bits of RAM for the two persistent variables (and whatever small amount you need for the temporary values).
I know this is horrible, and I know there can be all sorts of practical issues, but I thought it might give some of you a laugh or at least horrify you.
Here's some working C++ code which solves the problem.
Proof that the memory constraints are satisfied:
Editor: There is no proof of the maximum memory requirements offered by the author either in this post or in his blogs. Since the number of bits necessary to encode a value depends on the values previously encoded, such a proof is likely non-trivial. The author notes that the largest encoded size he could stumble upon empirically was 1011732, and chose the buffer size 1013000 arbitrarily.
typedef unsigned int u32;
namespace WorkArea
{
static const u32 circularSize = 253250;
u32 circular[circularSize] = { 0 }; // consumes 1013000 bytes
static const u32 stageSize = 8000;
u32 stage[stageSize]; // consumes 32000 bytes
...
Together, these two arrays take 1045000 bytes of storage. That leaves 1048576 - 1045000 - 2×1024 = 1528 bytes for remaining variables and stack space.
It runs in about 23 seconds on my Xeon W3520. You can verify that the program works using the following Python script, assuming a program name of sort1mb.exe.
from subprocess import *
import random
sequence = [random.randint(0, 99999999) for i in xrange(1000000)]
sorter = Popen('sort1mb.exe', stdin=PIPE, stdout=PIPE)
for value in sequence:
sorter.stdin.write('%08d\n' % value)
sorter.stdin.close()
result = [int(line) for line in sorter.stdout]
print('OK!' if result == sorted(sequence) else 'Error!')
A detailed explanation of the algorithm can be found in the following series of posts:
1MB Sorting Explained
Arithmetic Coding and the 1MB Sorting Problem
Arithmetic Encoding Using Fixed-Point Math
Please see the first correct answer or the later answer with arithmetic encoding. Below you may find some fun, but not a 100% bullet-proof solution.
This is quite an interesting task and here is an another solution. I hope somebody would find the result useful (or at least interesting).
Stage 1: Initial data structure, rough compression approach, basic results
Let's do some simple math: we have 1M (1048576 bytes) of RAM initially available to store 10^6 8 digit decimal numbers. [0;99999999]. So to store one number 27 bits are needed (taking the assumption that unsigned numbers will be used). Thus, to store a raw stream ~3.5M of RAM will be needed. Somebody already said it doesn't seem to be feasible, but I would say the task can be solved if the input is "good enough". Basically, the idea is to compress the input data with compression factor 0.29 or higher and do sorting in a proper manner.
Let's solve the compression issue first. There are some relevant tests already available:
http://www.theeggeadventure.com/wikimedia/index.php/Java_Data_Compression
"I ran a test to compress one million consecutive integers using
various forms of compression. The results are as follows:"
None 4000027
Deflate 2006803
Filtered 1391833
BZip2 427067
Lzma 255040
It looks like LZMA (Lempel–Ziv–Markov chain algorithm) is a good choice to continue with. I've prepared a simple PoC, but there are still some details to be highlighted:
Memory is limited so the idea is to presort numbers and use
compressed buckets (dynamic size) as temporary storage
It is easier to achieve a better compression factor with presorted
data, so there is a static buffer for each bucket (numbers from the buffer are to be sorted before LZMA)
Each bucket holds a specific range, so the final sort can be done for
each bucket separately
Bucket's size can be properly set, so there will be enough memory to
decompress stored data and do the final sort for each bucket separately
Please note, attached code is a POC, it can't be used as a final solution, it just demonstrates the idea of using several smaller buffers to store presorted numbers in some optimal way (possibly compressed). LZMA is not proposed as a final solution. It is used as a fastest possible way to introduce a compression to this PoC.
See the PoC code below (please note it just a demo, to compile it LZMA-Java will be needed):
public class MemorySortDemo {
static final int NUM_COUNT = 1000000;
static final int NUM_MAX = 100000000;
static final int BUCKETS = 5;
static final int DICT_SIZE = 16 * 1024; // LZMA dictionary size
static final int BUCKET_SIZE = 1024;
static final int BUFFER_SIZE = 10 * 1024;
static final int BUCKET_RANGE = NUM_MAX / BUCKETS;
static class Producer {
private Random random = new Random();
public int produce() { return random.nextInt(NUM_MAX); }
}
static class Bucket {
public int size, pointer;
public int[] buffer = new int[BUFFER_SIZE];
public ByteArrayOutputStream tempOut = new ByteArrayOutputStream();
public DataOutputStream tempDataOut = new DataOutputStream(tempOut);
public ByteArrayOutputStream compressedOut = new ByteArrayOutputStream();
public void submitBuffer() throws IOException {
Arrays.sort(buffer, 0, pointer);
for (int j = 0; j < pointer; j++) {
tempDataOut.writeInt(buffer[j]);
size++;
}
pointer = 0;
}
public void write(int value) throws IOException {
if (isBufferFull()) {
submitBuffer();
}
buffer[pointer++] = value;
}
public boolean isBufferFull() {
return pointer == BUFFER_SIZE;
}
public byte[] compressData() throws IOException {
tempDataOut.close();
return compress(tempOut.toByteArray());
}
private byte[] compress(byte[] input) throws IOException {
final BufferedInputStream in = new BufferedInputStream(new ByteArrayInputStream(input));
final DataOutputStream out = new DataOutputStream(new BufferedOutputStream(compressedOut));
final Encoder encoder = new Encoder();
encoder.setEndMarkerMode(true);
encoder.setNumFastBytes(0x20);
encoder.setDictionarySize(DICT_SIZE);
encoder.setMatchFinder(Encoder.EMatchFinderTypeBT4);
ByteArrayOutputStream encoderPrperties = new ByteArrayOutputStream();
encoder.writeCoderProperties(encoderPrperties);
encoderPrperties.flush();
encoderPrperties.close();
encoder.code(in, out, -1, -1, null);
out.flush();
out.close();
in.close();
return encoderPrperties.toByteArray();
}
public int[] decompress(byte[] properties) throws IOException {
InputStream in = new ByteArrayInputStream(compressedOut.toByteArray());
ByteArrayOutputStream data = new ByteArrayOutputStream(10 * 1024);
BufferedOutputStream out = new BufferedOutputStream(data);
Decoder decoder = new Decoder();
decoder.setDecoderProperties(properties);
decoder.code(in, out, 4 * size);
out.flush();
out.close();
in.close();
DataInputStream input = new DataInputStream(new ByteArrayInputStream(data.toByteArray()));
int[] array = new int[size];
for (int k = 0; k < size; k++) {
array[k] = input.readInt();
}
return array;
}
}
static class Sorter {
private Bucket[] bucket = new Bucket[BUCKETS];
public void doSort(Producer p, Consumer c) throws IOException {
for (int i = 0; i < bucket.length; i++) { // allocate buckets
bucket[i] = new Bucket();
}
for(int i=0; i< NUM_COUNT; i++) { // produce some data
int value = p.produce();
int bucketId = value/BUCKET_RANGE;
bucket[bucketId].write(value);
c.register(value);
}
for (int i = 0; i < bucket.length; i++) { // submit non-empty buffers
bucket[i].submitBuffer();
}
byte[] compressProperties = null;
for (int i = 0; i < bucket.length; i++) { // compress the data
compressProperties = bucket[i].compressData();
}
printStatistics();
for (int i = 0; i < bucket.length; i++) { // decode & sort buckets one by one
int[] array = bucket[i].decompress(compressProperties);
Arrays.sort(array);
for(int v : array) {
c.consume(v);
}
}
c.finalCheck();
}
public void printStatistics() {
int size = 0;
int sizeCompressed = 0;
for (int i = 0; i < BUCKETS; i++) {
int bucketSize = 4*bucket[i].size;
size += bucketSize;
sizeCompressed += bucket[i].compressedOut.size();
System.out.println(" bucket[" + i
+ "] contains: " + bucket[i].size
+ " numbers, compressed size: " + bucket[i].compressedOut.size()
+ String.format(" compression factor: %.2f", ((double)bucket[i].compressedOut.size())/bucketSize));
}
System.out.println(String.format("Data size: %.2fM",(double)size/(1014*1024))
+ String.format(" compressed %.2fM",(double)sizeCompressed/(1014*1024))
+ String.format(" compression factor %.2f",(double)sizeCompressed/size));
}
}
static class Consumer {
private Set<Integer> values = new HashSet<>();
int v = -1;
public void consume(int value) {
if(v < 0) v = value;
if(v > value) {
throw new IllegalArgumentException("Current value is greater than previous: " + v + " > " + value);
}else{
v = value;
values.remove(value);
}
}
public void register(int value) {
values.add(value);
}
public void finalCheck() {
System.out.println(values.size() > 0 ? "NOT OK: " + values.size() : "OK!");
}
}
public static void main(String[] args) throws IOException {
Producer p = new Producer();
Consumer c = new Consumer();
Sorter sorter = new Sorter();
sorter.doSort(p, c);
}
}
With random numbers it produces the following:
bucket[0] contains: 200357 numbers, compressed size: 353679 compression factor: 0.44
bucket[1] contains: 199465 numbers, compressed size: 352127 compression factor: 0.44
bucket[2] contains: 199682 numbers, compressed size: 352464 compression factor: 0.44
bucket[3] contains: 199949 numbers, compressed size: 352947 compression factor: 0.44
bucket[4] contains: 200547 numbers, compressed size: 353914 compression factor: 0.44
Data size: 3.85M compressed 1.70M compression factor 0.44
For a simple ascending sequence (one bucket is used) it produces:
bucket[0] contains: 1000000 numbers, compressed size: 256700 compression factor: 0.06
Data size: 3.85M compressed 0.25M compression factor 0.06
EDIT
Conclusion:
Don't try to fool the Nature
Use simpler compression with lower memory footprint
Some additional clues are really needed. Common bullet-proof solution does not seem to be feasible.
Stage 2: Enhanced compression, final conclusion
As was already mentioned in the previous section, any suitable compression technique can be used. So let's get rid of LZMA in favor of simpler and better (if possible) approach. There are a lot of good solutions including Arithmetic coding, Radix tree etc.
Anyway, simple but useful encoding scheme will be more illustrative than yet another external library, providing some nifty algorithm. The actual solution is pretty straightforward: since there are buckets with partially sorted data, deltas can be used instead of numbers.
Random input test shows slightly better results:
bucket[0] contains: 10103 numbers, compressed size: 13683 compression factor: 0.34
bucket[1] contains: 9885 numbers, compressed size: 13479 compression factor: 0.34
...
bucket[98] contains: 10026 numbers, compressed size: 13612 compression factor: 0.34
bucket[99] contains: 10058 numbers, compressed size: 13701 compression factor: 0.34
Data size: 3.85M compressed 1.31M compression factor 0.34
Sample code
public static void encode(int[] buffer, int length, BinaryOut output) {
short size = (short)(length & 0x7FFF);
output.write(size);
output.write(buffer[0]);
for(int i=1; i< size; i++) {
int next = buffer[i] - buffer[i-1];
int bits = getBinarySize(next);
int len = bits;
if(bits > 24) {
output.write(3, 2);
len = bits - 24;
}else if(bits > 16) {
output.write(2, 2);
len = bits-16;
}else if(bits > 8) {
output.write(1, 2);
len = bits - 8;
}else{
output.write(0, 2);
}
if (len > 0) {
if ((len % 2) > 0) {
len = len / 2;
output.write(len, 2);
output.write(false);
} else {
len = len / 2 - 1;
output.write(len, 2);
}
output.write(next, bits);
}
}
}
public static short decode(BinaryIn input, int[] buffer, int offset) {
short length = input.readShort();
int value = input.readInt();
buffer[offset] = value;
for (int i = 1; i < length; i++) {
int flag = input.readInt(2);
int bits;
int next = 0;
switch (flag) {
case 0:
bits = 2 * input.readInt(2) + 2;
next = input.readInt(bits);
break;
case 1:
bits = 8 + 2 * input.readInt(2) +2;
next = input.readInt(bits);
break;
case 2:
bits = 16 + 2 * input.readInt(2) +2;
next = input.readInt(bits);
break;
case 3:
bits = 24 + 2 * input.readInt(2) +2;
next = input.readInt(bits);
break;
}
buffer[offset + i] = buffer[offset + i - 1] + next;
}
return length;
}
Please note, this approach:
does not consume a lot of memory
works with streams
provides not so bad results
Full code can be found here, BinaryInput and BinaryOutput implementations can be found here
Final conclusion
No final conclusion :) Sometimes it is really good idea to move one level up and review the task from a meta-level point of view.
It was fun to spend some time with this task. BTW, there are a lot of interesting answers below. Thank you for your attention and happy codding.
A solution is possible only because of the difference between 1 megabyte and 1 million bytes. There are about 2 to the power 8093729.5 different ways to choose 1 million 8-digit numbers with duplicates allowed and order unimportant, so a machine with only 1 million bytes of RAM doesn't have enough states to represent all the possibilities. But 1M (less 2k for TCP/IP) is 1022*1024*8 = 8372224 bits, so a solution is possible.
Part 1, initial solution
This approach needs a little more than 1M, I'll refine it to fit into 1M later.
I'll store a compact sorted list of numbers in the range 0 to 99999999 as a sequence of sublists of 7-bit numbers. The first sublist holds numbers from 0 to 127, the second sublist holds numbers from 128 to 255, etc. 100000000/128 is exactly 781250, so 781250 such sublists will be needed.
Each sublist consists of a 2-bit sublist header followed by a sublist body. The sublist body takes up 7 bits per sublist entry. The sublists are all concatenated together, and the format makes it possible to tell where one sublist ends and the next begins. The total storage required for a fully populated list is 2*781250 + 7*1000000 = 8562500 bits, which is about 1.021 M-bytes.
The 4 possible sublist header values are:
00 Empty sublist, nothing follows.
01 Singleton, there is only one entry in the sublist and and next 7 bits hold it.
10 The sublist holds at least 2 distinct numbers. The entries are stored in non-decreasing order, except that the last entry is less than or equal to the first. This allows the end of the sublist to be identified. For example, the numbers 2,4,6 would be stored as (4,6,2). The numbers 2,2,3,4,4 would be stored as (2,3,4,4,2).
11 The sublist holds 2 or more repetitions of a single number. The next 7 bits give the number. Then come zero or more 7-bit entries with the value 1, followed by a 7-bit entry with the value 0. The length of the sublist body dictates the number of repetitions. For example, the numbers 12,12 would be stored as (12,0), the numbers 12,12,12 would be stored as (12,1,0), 12,12,12,12 would be (12,1,1,0) and so on.
I start off with an empty list, read a bunch of numbers in and store them as 32 bit integers, sort the new numbers in place (using heapsort, probably) and then merge them into a new compact sorted list. Repeat until there are no more numbers to read, then walk the compact list once more to generate the output.
The line below represents memory just before the start of the list merge operation. The "O"s are the region that hold the sorted 32-bit integers. The "X"s are the region that hold the old compact list. The "=" signs are the expansion room for the compact list, 7 bits for each integer in the "O"s. The "Z"s are other random overhead.
ZZZOOOOOOOOOOOOOOOOOOOOOOOOOO==========XXXXXXXXXXXXXXXXXXXXXXXXXX
The merge routine starts reading at the leftmost "O" and at the leftmost "X", and starts writing at the leftmost "=". The write pointer doesn't catch the compact list read pointer until all of the new integers are merged, because both pointers advance 2 bits for each sublist and 7 bits for each entry in the old compact list, and there is enough extra room for the 7-bit entries for the new numbers.
Part 2, cramming it into 1M
To Squeeze the solution above into 1M, I need to make the compact list format a bit more compact. I'll get rid of one of the sublist types, so that there will be just 3 different possible sublist header values. Then I can use "00", "01" and "1" as the sublist header values and save a few bits. The sublist types are:
A Empty sublist, nothing follows.
B Singleton, there is only one entry in the sublist and and next 7 bits hold it.
C The sublist holds at least 2 distinct numbers. The entries are stored in non-decreasing order, except that the last entry is less than or equal to the first. This allows the end of the sublist to be identified. For example, the numbers 2,4,6 would be stored as (4,6,2). The numbers 2,2,3,4,4 would be stored as (2,3,4,4,2).
D The sublist consists of 2 or more repetitions of a single number.
My 3 sublist header values will be "A", "B" and "C", so I need a way to represent D-type sublists.
Suppose I have the C-type sublist header followed by 3 entries, such as "C[17][101][58]". This can't be part of a valid C-type sublist as described above, since the third entry is less than the second but more than the first. I can use this type of construct to represent a D-type sublist. In bit terms, anywhere I have "C{00?????}{1??????}{01?????}" is an impossible C-type sublist. I'll use this to represent a sublist consisting of 3 or more repetitions of a single number. The first two 7-bit words encode the number (the "N" bits below) and are followed by zero or more {0100001} words followed by a {0100000} word.
For example, 3 repetitions: "C{00NNNNN}{1NN0000}{0100000}", 4 repetitions: "C{00NNNNN}{1NN0000}{0100001}{0100000}", and so on.
That just leaves lists that hold exactly 2 repetitions of a single number. I'll represent those with another impossible C-type sublist pattern: "C{0??????}{11?????}{10?????}". There's plenty of room for the 7 bits of the number in the first 2 words, but this pattern is longer than the sublist that it represents, which makes things a bit more complex. The five question-marks at the end can be considered not part of the pattern, so I have: "C{0NNNNNN}{11N????}10" as my pattern, with the number to be repeated stored in the "N"s. That's 2 bits too long.
I'll have to borrow 2 bits and pay them back from the 4 unused bits in this pattern. When reading, on encountering "C{0NNNNNN}{11N00AB}10", output 2 instances of the number in the "N"s, overwrite the "10" at the end with bits A and B, and rewind the read pointer by 2 bits. Destructive reads are ok for this algorithm, since each compact list gets walked only once.
When writing a sublist of 2 repetitions of a single number, write "C{0NNNNNN}11N00" and set the borrowed bits counter to 2. At every write where the borrowed bits counter is non-zero, it is decremented for each bit written and "10" is written when the counter hits zero. So the next 2 bits written will go into slots A and B, and then the "10" will get dropped onto the end.
With 3 sublist header values represented by "00", "01" and "1", I can assign "1" to the most popular sublist type. I'll need a small table to map sublist header values to sublist types, and I'll need an occurrence counter for each sublist type so that I know what the best sublist header mapping is.
The worst case minimal representation of a fully populated compact list occurs when all the sublist types are equally popular. In that case I save 1 bit for every 3 sublist headers, so the list size is 2*781250 + 7*1000000 - 781250/3 = 8302083.3 bits. Rounding up to a 32 bit word boundary, thats 8302112 bits, or 1037764 bytes.
1M minus the 2k for TCP/IP state and buffers is 1022*1024 = 1046528 bytes, leaving me 8764 bytes to play with.
But what about the process of changing the sublist header mapping ? In the memory map below, "Z" is random overhead, "=" is free space, "X" is the compact list.
ZZZ=====XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Start reading at the leftmost "X" and start writing at the leftmost "=" and work right. When it's done the compact list will be a little shorter and it will be at the wrong end of memory:
ZZZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX=======
So then I'll need to shunt it to the right:
ZZZ=======XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
In the header mapping change process, up to 1/3 of the sublist headers will be changing from 1-bit to 2-bit. In the worst case these will all be at the head of the list, so I'll need at least 781250/3 bits of free storage before I start, which takes me back to the memory requirements of the previous version of the compact list :(
To get around that, I'll split the 781250 sublists into 10 sublist groups of 78125 sublists each. Each group has its own independent sublist header mapping. Using the letters A to J for the groups:
ZZZ=====AAAAAABBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
Each sublist group shrinks or stays the same during a sublist header mapping change:
ZZZ=====AAAAAABBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAA=====BBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABB=====CCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCC======DDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDD======EEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEE======FFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFF======GGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGG=======HHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHH=======IJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHI=======JJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ=======
ZZZ=======AAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
The worst case temporary expansion of a sublist group during a mapping change is 78125/3 = 26042 bits, under 4k. If I allow 4k plus the 1037764 bytes for a fully populated compact list, that leaves me 8764 - 4096 = 4668 bytes for the "Z"s in the memory map.
That should be plenty for the 10 sublist header mapping tables, 30 sublist header occurrence counts and the other few counters, pointers and small buffers I'll need, and space I've used without noticing, like stack space for function call return addresses and local variables.
Part 3, how long would it take to run?
With an empty compact list the 1-bit list header will be used for an empty sublist, and the starting size of the list will be 781250 bits. In the worst case the list grows 8 bits for each number added, so 32 + 8 = 40 bits of free space are needed for each of the 32-bit numbers to be placed at the top of the list buffer and then sorted and merged. In the worst case, changing the sublist header mapping results in a space usage of 2*781250 + 7*entries - 781250/3 bits.
With a policy of changing the sublist header mapping after every fifth merge once there are at least 800000 numbers in the list, a worst case run would involve a total of about 30M of compact list reading and writing activity.
Source:
http://nick.cleaton.net/ramsortsol.html
Gilmanov's answer is very wrong in its assumptions. It starts speculating based in a pointless measure of a million consecutive integers. That means no gaps. Those random gaps, however small, really makes it a poor idea.
Try it yourself. Get 1 million random 27-bit integers, sort them, compress with 7-Zip, xz, whatever LZMA you want. The result is over 1.5 MB. The premise on top is the compression of sequential numbers. Even delta encoding of that is over 1.1 MB. And never mind, this is using over 100 MB of RAM for compression. So even the compressed integers don't fit the problem and never mind run time RAM usage.
It's saddens me how people just upvote pretty graphics and rationalization.
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
int32_t ints[1000000]; // Random 27-bit integers
int cmpi32(const void *a, const void *b) {
return ( *(int32_t *)a - *(int32_t *)b );
}
int main() {
int32_t *pi = ints; // Pointer to input ints (REPLACE W/ read from net)
// Fill pseudo-random integers of 27 bits
srand(time(NULL));
for (int i = 0; i < 1000000; i++)
ints[i] = rand() & ((1<<27) - 1); // Random 32 bits masked to 27 bits
qsort(ints, 1000000, sizeof (ints[0]), cmpi32); // Sort 1000000 int32s
// Now delta encode, optional, store differences to previous int
for (int i = 1, prev = ints[0]; i < 1000000; i++) {
ints[i] -= prev;
prev += ints[i];
}
FILE *f = fopen("ints.bin", "w");
fwrite(ints, 4, 1000000, f);
fclose(f);
exit(0);
}
Now compress ints.bin with LZMA...
$ xz -f --keep ints.bin # 100 MB RAM
$ 7z a ints.bin.7z ints.bin # 130 MB RAM
$ ls -lh ints.bin*
3.8M ints.bin
1.1M ints.bin.7z
1.2M ints.bin.xz
I think one way to think about this is from a combinatorics viewpoint: how many possible combinations of sorted number orderings are there? If we give the combination 0,0,0,....,0 the code 0, and 0,0,0,...,1 the code 1, and 99999999, 99999999, ... 99999999 the code N, what is N? In other words, how big is the result space?
Well, one way to think about this is noticing that this is a bijection of the problem of finding the number of monotonic paths in an N x M grid, where N = 1,000,000 and M = 100,000,000. In other words, if you have a grid that is 1,000,000 wide and 100,000,000 tall, how many shortest paths from the bottom left to the top right are there? Shortest paths of course require you only ever either move right or up (if you were to move down or left you would be undoing previously accomplished progress). To see how this is a bijection of our number sorting problem, observe the following:
You can imagine any horizontal leg in our path as a number in our ordering, where the Y location of the leg represents the value.
So if the path simply moves to the right all the way to the end, then jumps all the way to the top, that is equivalent to the ordering 0,0,0,...,0. if it instead begins by jumping all the way to the top and then moves to the right 1,000,000 times, that is equivalent to 99999999,99999999,..., 99999999. A path where it moves right once, then up once, then right one, then up once, etc to the very end (then necessarily jumps all the way to the top), is equivalent to 0,1,2,3,...,999999.
Luckily for us this problem has already been solved, such a grid has (N + M) Choose (M) paths:
(1,000,000 + 100,000,000) Choose (100,000,000) ~= 2.27 * 10^2436455
N thus equals 2.27 * 10^2436455, and so the code 0 represents 0,0,0,...,0 and the code 2.27 * 10^2436455 and some change represents 99999999,99999999,..., 99999999.
In order to store all the numbers from 0 to 2.27 * 10^2436455 you need lg2 (2.27 * 10^2436455) = 8.0937 * 10^6 bits.
1 megabyte = 8388608 bits > 8093700 bits
So it appears that we at least actually have enough room to store the result! Now of course the interesting bit is doing the sorting as the numbers stream in. Not sure the best approach to this is given we have 294908 bits remaining. I imagine an interesting technique would be to at each point assume that that is is the entire ordering, finding the code for that ordering, and then as you receive a new number going back and updating the previous code. Hand wave hand wave.
My suggestions here owe a lot to Dan's solution
First off I assume the solution must handle all possible input lists. I think the popular answers do not make this assumption (which IMO is a huge mistake).
It is known that no form of lossless compression will reduce the size of all inputs.
All the popular answers assume they will be able to apply compression effective enough to allow them extra space. In fact, a chunk of extra space large enough to hold some portion of their partially completed list in an uncompressed form and allow them to perform their sorting operations. This is just a bad assumption.
For such a solution, anyone with knowledge of how they do their compression will be able to design some input data that does not compress well for this scheme, and the "solution" will most likely then break due to running out of space.
Instead I take a mathematical approach. Our possible outputs are all the lists of length LEN consisting of elements in the range 0..MAX. Here the LEN is 1,000,000 and our MAX is 100,000,000.
For arbitrary LEN and MAX, the amount of bits needed to encode this state is:
Log2(MAX Multichoose LEN)
So for our numbers, once we have completed recieving and sorting, we will need at least Log2(100,000,000 MC 1,000,000) bits to store our result in a way that can uniquely distinguish all possible outputs.
This is ~= 988kb. So we actually have enough space to hold our result. From this point of view, it is possible.
[Deleted pointless rambling now that better examples exist...]
Best answer is here.
Another good answer is here and basically uses insertion sort as the function to expand the list by one element (buffers a few elements and pre-sorts, to allow insertion of more than one at a time, saves a bit of time). uses a nice compact state encoding too, buckets of seven bit deltas
Suppose this task is possible. Just prior to output, there will be an in-memory representation of the million sorted numbers. How many different such representations are there? Since there may be repeated numbers we can't use nCr (choose), but there is an operation called multichoose that works on multisets.
There are 2.2e2436455 ways to choose a million numbers in range 0..99,999,999.
That requires 8,093,730 bits to represent every possible combination, or 1,011,717 bytes.
So theoretically it may be possible, if you can come up with a sane (enough) representation of the sorted list of numbers. For example, an insane representation might require a 10MB lookup table or thousands of lines of code.
However, if "1M RAM" means one million bytes, then clearly there is not enough space. The fact that 5% more memory makes it theoretically possible suggests to me that the representation will have to be VERY efficient and probably not sane.
(My original answer was wrong, sorry for the bad math, see below the break.)
How about this?
The first 27 bits store the lowest number you have seen, then the difference to the next number seen, encoded as follows: 5 bits to store the number of bits used in storing the difference, then the difference. Use 00000 to indicate that you saw that number again.
This works because as more numbers are inserted, the average difference between numbers goes down, so you use less bits to store the difference as you add more numbers. I believe this is called a delta list.
The worst case I can think of is all numbers evenly spaced (by 100), e.g. Assuming 0 is the first number:
000000000000000000000000000 00111 1100100
^^^^^^^^^^^^^
a million times
27 + 1,000,000 * (5+7) bits = ~ 427k
Reddit to the rescue!
If all you had to do was sort them, this problem would be easy. It takes 122k (1 million bits) to store which numbers you have seen (0th bit on if 0 was seen, 2300th bit on if 2300 was seen, etc.
You read the numbers, store them in the bit field, and then shift the bits out while keeping a count.
BUT, you have to remember how many you have seen. I was inspired by the sublist answer above to come up with this scheme:
Instead of using one bit, use either 2 or 27 bits:
00 means you did not see the number.
01 means you saw it once
1 means you saw it, and the next 26 bits are the count of how many times.
I think this works: if there are no duplicates, you have a 244k list.
In the worst case you see each number twice (if you see one number three times, it shortens the rest of the list for you), that means you have seen 50,000 more than once, and you have seen 950,000 items 0 or 1 times.
50,000 * 27 + 950,000 * 2 = 396.7k.
You can make further improvements if you use the following encoding:
0 means you did not see the number
10 means you saw it once
11 is how you keep count
Which will, on average, result in 280.7k of storage.
EDIT: my Sunday morning math was wrong.
The worst case is we see 500,000 numbers twice, so the math becomes:
500,000 *27 + 500,000 *2 = 1.77M
The alternate encoding results in an average storage of
500,000 * 27 + 500,000 = 1.70M
: (
There is one solution to this problem across all possible inputs. Cheat.
Read m values over TCP, where m is near the max that can be sorted in memory, maybe n/4.
Sort the 250,000 (or so) numbers and output them.
Repeat for the other 3 quarters.
Let the receiver merge the 4 lists of numbers it has received as it processes them. (It's not much slower than using a single list.)
What kind of computer are you using? It may not have any other "normal" local storage, but does it have video RAM, for example? 1 megapixel x 32 bits per pixel (say) is pretty close to your required data input size.
(I largely ask in memory of the old Acorn RISC PC, which could 'borrow' VRAM to expand the available system RAM, if you chose a low resolution or low colour-depth screen mode!). This was rather useful on a machine with only a few MB of normal RAM.
I would try a Radix Tree. If you could store the data in a tree, you could then do an in-order traverse to transmit the data.
I'm not sure you could fit this into 1MB, but I think it's worth a try.
A radix tree representation would come close to handling this problem, since the radix tree takes advantage of "prefix compression". But it's hard to conceive of a radix tree representation that could represent a single node in one byte -- two is probably about the limit.
But, regardless of how the data is represented, once it is sorted it can be stored in prefix-compressed form, where the numbers 10, 11, and 12 would be represented by, say 001b, 001b, 001b, indicating an increment of 1 from the previous number. Perhaps, then, 10101b would represent an increment of 5, 1101001b an increment of 9, etc.
There are 10^6 values in a range of 10^8, so there's one value per hundred code points on average. Store the distance from the Nth point to the (N+1)th. Duplicate values have a skip of 0. This means that the skip needs an average of just under 7 bits to store, so a million of them will happily fit into our 8 million bits of storage.
These skips need to be encoded into a bitstream, say by Huffman encoding. Insertion is by iterating through the bitstream and rewriting after the new value. Output by iterating through and writing out the implied values. For practicality, it probably wants to be done as, say, 10^4 lists covering 10^4 code points (and an average of 100 values) each.
A good Huffman tree for random data can be built a priori by assuming a Poisson distribution (mean=variance=100) on the length of the skips, but real statistics can be kept on the input and used to generate an optimal tree to deal with pathological cases.
I have a computer with 1M of RAM and no other local storage
Another way to cheat: you could use non-local (networked) storage instead (your question does not preclude this) and call a networked service that could use straightforward disk-based mergesort (or just enough RAM to sort in-memory, since you only need to accept 1M numbers), without needing the (admittedly extremely ingenious) solutions already given.
This might be cheating, but it's not clear whether you are looking for a solution to a real-world problem, or a puzzle that invites bending of the rules... if the latter, then a simple cheat may get better results than a complex but "genuine" solution (which as others have pointed out, can only work for compressible inputs).
Google's (bad) approach, from HN thread. Store RLE-style counts.
Your initial data structure is '99999999:0' (all zeros, haven't seen any numbers) and then lets say you see the number 3,866,344 so your data structure becomes '3866343:0,1:1,96133654:0' as you can see the numbers will always alternate between number of zero bits and number of '1' bits so you can just assume the odd numbers represent 0 bits and the even numbers 1 bits. This becomes (3866343,1,96133654)
Their problem doesn't seem to cover duplicates, but let's say they use "0:1" for duplicates.
Big problem #1: insertions for 1M integers would take ages.
Big problem #2: like all plain delta encoding solutions, some distributions can't be covered this way. For example, 1m integers with distances 0:99 (e.g. +99 each one). Now think the same but with random distance in the range of 0:99. (Note: 99999999/1000000 = 99.99)
Google's approach is both unworthy (slow) and incorrect. But to their defense, their problem might have been slightly different.
I think the solution is to combine techniques from video encoding, namely the discrete cosine transformation. In digital video, rather recording the changing the brightness or colour of video as regular values such as 110 112 115 116, each is subtracted from the last (similar to run length encoding). 110 112 115 116 becomes 110 2 3 1. The values, 2 3 1 require less bits than the originals.
So lets say we create a list of the input values as they arrive on the socket. We are storing in each element, not the value, but the offset of the one before it. We sort as we go, so the offsets are only going to be positive. But the offset could be 8 decimal digits wide which this fits in 3 bytes. Each element can't be 3 bytes, so we need to pack these. We could use the top bit of each byte as a "continue bit", indicating that the next byte is part of the number and the lower 7 bits of each byte need to be combined. zero is valid for duplicates.
As the list fills up, the numbers should be get closer together, meaning on average only 1 byte is used to determine the distance to the next value. 7 bits of value and 1 bit of offset if convenient, but there may be a sweet spot that requires less than 8 bits for a "continue" value.
Anyway, I did some experiment. I use a random number generator and I can fit a million sorted 8 digit decimal numbers into about 1279000 bytes. The average space between each number is consistently 99...
public class Test {
public static void main(String[] args) throws IOException {
// 1 million values
int[] values = new int[1000000];
// create random values up to 8 digits lrong
Random random = new Random();
for (int x=0;x<values.length;x++) {
values[x] = random.nextInt(100000000);
}
Arrays.sort(values);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int av = 0;
writeCompact(baos, values[0]); // first value
for (int x=1;x<values.length;x++) {
int v = values[x] - values[x-1]; // difference
av += v;
System.out.println(values[x] + " diff " + v);
writeCompact(baos, v);
}
System.out.println("Average offset " + (av/values.length));
System.out.println("Fits in " + baos.toByteArray().length);
}
public static void writeCompact(OutputStream os, long value) throws IOException {
do {
int b = (int) value & 0x7f;
value = (value & 0x7fffffffffffffffl) >> 7;
os.write(value == 0 ? b : (b | 0x80));
} while (value != 0);
}
}
We could play with the networking stack to send the numbers in sorted order before we have all the numbers. If you send 1M of data, TCP/IP will break it into 1500 byte packets and stream them in order to the target. Each packet will be given a sequence number.
We can do this by hand. Just before we fill our RAM we can sort what we have and send the list to our target but leave holes in our sequence around each number. Then process the 2nd 1/2 of the numbers the same way using those holes in the sequence.
The networking stack on the far end will assemble the resulting data stream in order of sequence before handing it up to the application.
It's using the network to perform a merge sort. This is a total hack, but I was inspired by the other networking hack listed previously.
I would exploit the retransmission behaviour of TCP.
Make the TCP component create a large receive window.
Receive some amount of packets without sending an ACK for them.
Process those in passes creating some (prefix) compressed data structure
Send duplicate ack for last packet that is not needed anymore/wait for retransmission timeout
Goto 2
All packets were accepted
This assumes some kind of benefit of buckets or multiple passes.
Probably by sorting the batches/buckets and merging them. -> radix trees
Use this technique to accept and sort the first 80% then read the last 20%, verify that the last 20% do not contain numbers that would land in the first 20% of the lowest numbers. Then send the 20% lowest numbers, remove from memory, accept the remaining 20% of new numbers and merge.**
To represent the sorted array one can just store the first element and the difference between adjacent elements. In this way we are concerned with encoding 10^6 elements that can sum up to at most 10^8. Let's call this D. To encode the elements of D one can use a Huffman code. The dictionary for the Huffman code can be created on the go and the array updated every time a new item is inserted in the sorted array (insertion sort). Note that when the dictionary changes because of a new item the whole array should be updated to match the new encoding.
The average number of bits for encoding each element of D is maximized if we have equal number of each unique element. Say elements d1, d2, ..., dN in D each appear F times. In that case (in worst case we have both 0 and 10^8 in input sequence) we have
sum(1<=i<=N) F. di = 10^8
where
sum(1<=i<=N) F = 10^6, or F=10^6/N and the normalized frequency will be p= F/10^=1/N
The average number of bits will be -log2(1/P) = log2(N). Under these circumstances we should find a case that maximizes N. This happens if we have consecutive numbers for di starting from 0, or, di= i-1, therefore
10^8=sum(1<=i<=N) F. di = sum(1<=i<=N) (10^6/N) (i-1) = (10^6/N) N (N-1)/2
i.e.
N <= 201. And for this case average number of bits is log2(201)=7.6511 which means we will need around 1 byte per input element for saving the sorted array. Note that this doesn't mean D in general cannot have more than 201 elements. It just sows that if elements of D are uniformly distributed, it cannot have more than 201 unique values.
Here is a generalized solution to this kind of problem:
General procedure
The taken approach is as follows. The algorithm operates on a single buffer of 32-bit words. It performs the following procedure in a loop:
We start with a buffer filled with compressed data from the last iteration. The buffer looks like this
|compressed sorted|empty|
Calculate the maximum amount of numbers that can be stored in this buffer, both compressed and uncompressed. Split the buffer into these two sections, beginning with the space for compressed data, ending with the uncompressed data. The buffer looks like
|compressed sorted|empty|empty|
Fill the uncompressed section with numbers to be sorted. The buffer looks like
|compressed sorted|empty|uncompressed unsorted|
Sort the new numbers with an in-place sort. The buffer looks like
|compressed sorted|empty|uncompressed sorted|
Right-align any already compressed data from the previous iteration in the compressed section. At this point the buffer is partitioned
|empty|compressed sorted|uncompressed sorted|
Perform a streaming decompression-recompression on the compressed section, merging in the sorted data in the uncompressed section. The old compressed section is consumed as the new compressed section grows. The buffer looks like
|compressed sorted|empty|
This procedure is performed until all numbers have been sorted.
Compression
This algorithm of course only works when it's possible to calculate the final compressed size of the new sorting buffer before actually knowing what will actually be compressed. Next to that, the compression algorithm needs to be good enough to solve the actual problem.
The used approach uses three steps. First, the algorithm will always store sorted sequences, therefore we can instead store purely the differences between consecutive entries. Each difference is in the range [0, 99999999].
These differences are then encoded as a unary bitstream. A 1 in this stream means "Add 1 to the accumulator, A 0 means "Emit the accumulator as an entry, and reset". So difference N will be represented by N 1's and one 0.
The sum of all differences will approach the maximum value that the algorithm supports, and the count of all differences will approach the amount of values inserted in the algorithm. This means we expect the stream to, at the end, contain max value 1's and count 0's. This allows us to calculate the expected probability of a 0 and 1 in the stream. Namely, the probability of a 0 is count/(count+maxval) and the probability of a 1 is maxval/(count+maxval).
We use these probabilities to define an arithmetic coding model over this bitstream. This arithmetic code will encode exactly this amounts of 1's and 0's in optimal space. We can calculate the space used by this model for any intermediate bitstream as: bits = encoded * log2(1 + amount / maxval) + maxval * log2(1 + maxval / amount). To calculate the total required space for the algorithm, set encoded equal to amount.
To not require a ridiculous amount of iterations, a small overhead can be added to the buffer. This will ensure that the algorithm will at least operate on the amount of numbers that fit in this overhead, as by far the largest time cost of the algorithm is the arithmetic coding compression and decompression each cycle.
Next to that, some overhead is necessary to store bookkeeping data and to handle slight inaccuracies in the fixed-point approximation of the arithmetic coding algorithm, but in total the algorithm is able to fit in 1MiB of space even with an extra buffer that can contain 8000 numbers, for a total of 1043916 bytes of space.
Optimality
Outside of reducing the (small) overhead of the algorithm it should be theoretically impossible to get a smaller result. To just contain the entropy of the final result, 1011717 bytes would be necessary. If we subtract the extra buffer added for efficiency this algorithm used 1011916 bytes to store the final result + overhead.
If the input stream could be received few times this would be much
easier (no information about that, idea and time-performance problem).
Then, we could count the decimal values. With counted values it would be
easy to make the output stream. Compress by counting the values. It
depends what would be in the input stream.
If the input stream could be received few times this would be much easier (no info about that, idea and time-performance problem). Then, we could count the decimal values. With counted values it would be easy to make the output stream. Compress by counting the values.
It depends what would be in the input stream.
Sorting is a secondary problem here. As other said, just storing the integers is hard, and cannot work on all inputs, since 27 bits per number would be necessary.
My take on this is: store only the differences between the consecutive (sorted) integers, as they will be most likely small. Then use a compression scheme, e.g. with 2 additional bits per input number, to encode how many bits the number is stored on.
Something like:
00 -> 5 bits
01 -> 11 bits
10 -> 19 bits
11 -> 27 bits
It should be possible to store a fair number of possible input lists within the given memory constraint. The maths of how to pick the compression scheme to have it work on the maximum number of inputs, are beyond me.
I hope you may be able to exploit domain-specific knowledge of your input to find a good enough integer compression scheme based on this.
Oh and then, you do an insertion sort on that sorted list as you receive data.
Now aiming to an actual solution, covering all possible cases of input in the 8 digit range with only 1MB of RAM. NOTE: work in progress, tomorrow will continue. Using arithmetic coding of deltas of the sorted ints, worst case for 1M sorted ints would cost about 7bits per entry (since 99999999/1000000 is 99, and log2(99) is almost 7 bits).
But you need the 1m integers sorted to get to 7 or 8 bits! Shorter series would have bigger deltas, therefore more bits per element.
I'm working on taking as many as possible and compressing (almost) in-place. First batch of close to 250K ints would need about 9 bits each at best. So result would take about 275KB. Repeat with remaining free memory a few times. Then decompress-merge-in-place-compress those compressed chunks. This is quite hard, but possible. I think.
The merged lists would get closer and closer to the 7bit per integer target. But I don't know how many iterations it would take of the merge loop. Perhaps 3.
But the imprecision of the arithmetic coding implementation might make it impossible. If this problem is possible at all, it would be extremely tight.
Any volunteers?
You just need to store the differences between the numbers in sequence, and use an encoding to compress these sequence numbers. We have 2^23 bits. We shall divide it into 6bit chunks, and let the last bit indicate whether the number extends to another 6 bits (5bits plus extending chunk).
Thus, 000010 is 1, and 000100 is 2. 000001100000 is 128. Now, we consider the worst cast in representing differences in sequence of a numbers up to 10,000,000. There can be 10,000,000/2^5 differences greater than 2^5, 10,000,000/2^10 differences greater than 2^10, and 10,000,000/2^15 differences greater than 2^15, etc.
So, we add how many bits it will take to represent our the sequence. We have 1,000,000*6 + roundup(10,000,000/2^5)*6+roundup(10,000,000/2^10)*6+roundup(10,000,000/2^15)*6+roundup(10,000,000/2^20)*4=7935479.
2^24 = 8388608. Since 8388608 > 7935479, we should easily have enough memory. We will probably need another little bit of memory to store the sum of where are when we insert new numbers. We then go through the sequence, and find where to insert our new number, decrease the next difference if necessary, and shift everything after it right.
If we don't know anything about those numbers, we are limited by the following constraints:
we need to load all numbers before we can sort them them,
the set of numbers is not compressible.
If these assumptions hold, there is no way to carry out your task, as you will need at least 26,575,425 bits of storage (3,321,929 bytes).
What can you tell us about your data ?
The trick is to represent the algorithms state, which is an integer multi-set, as a compressed stream of "increment counter"="+" and "output counter"="!" characters. For example, the set {0,3,3,4} would be represented as "!+++!!+!", followed by any number of "+" characters. To modify the multi-set you stream out the characters, keeping only a constant amount decompressed at a time, and make changes inplace before streaming them back in compressed form.
Details
We know there are exactly 10^6 numbers in the final set, so there are at most 10^6 "!" characters. We also know that our range has size 10^8, meaning there are at most 10^8 "+" characters. The number of ways we can arrange 10^6 "!"s amongst 10^8 "+"s is (10^8 + 10^6) choose 10^6, and so specifying some particular arrangement takes ~0.965 MiB` of data. That'll be a tight fit.
We can treat each character as independent without exceeding our quota. There are exactly 100 times more "+" characters than "!" characters, which simplifies to 100:1 odds of each character being a "+" if we forget that they are dependent. Odds of 100:101 corresponds to ~0.08 bits per character, for an almost identical total of ~0.965 MiB (ignoring the dependency has a cost of only ~12 bits in this case!).
The simplest technique for storing independent characters with known prior probability is Huffman coding. Note that we need an impractically large tree (A huffman tree for blocks of 10 characters has an average cost per block of about 2.4 bits, for a total of ~2.9 Mib. A huffman tree for blocks of 20 characters has an average cost per block of about 3 bits, which is a total of ~1.8 MiB. We're probably going to need a block of size on the order of a hundred, implying more nodes in our tree than all the computer equipment that has ever existed can store.). However, ROM is technically "free" according to the problem and practical solutions that take advantage of the regularity in the tree will look essentially the same.
Pseudo-code
Have a sufficiently large huffman tree (or similar block-by-block compression data) stored in ROM
Start with a compressed string of 10^8 "+" characters.
To insert the number N, stream out the compressed string until N "+" characters have gone past then insert a "!". Stream the recompressed string back over the previous one as you go, keeping a constant amount of buffered blocks to avoid over/under-runs.
Repeat one million times: [input, stream decompress>insert>compress], then decompress to output
We have 1 MB - 3 KB RAM = 2^23 - 3*2^13 bits = 8388608 - 24576 = 8364032 bits available.
We are given 10^6 numbers in a 10^8 range. This gives an average gap of ~100 < 2^7 = 128
Let's first consider the simpler problem of fairly evenly spaced numbers when all gaps are < 128. This is easy. Just store the first number and the 7-bit gaps:
(27 bits) + 10^6 7-bit gap numbers = 7000027 bits required
Note repeated numbers have gaps of 0.
But what if we have gaps larger than 127?
OK, let's say a gap size < 127 is represented directly, but a gap size of 127 is followed by a continuous 8-bit encoding for the actual gap length:
10xxxxxx xxxxxxxx = 127 .. 16,383
110xxxxx xxxxxxxx xxxxxxxx = 16384 .. 2,097,151
etc.
Note this number representation describes its own length so we know when the next gap number starts.
With just small gaps < 127, this still requires 7000027 bits.
There can be up to (10^8)/(2^7) = 781250 23-bit gap number, requiring an extra 16*781,250 = 12,500,000 bits which is too much. We need a more compact and slowly increasing representation of gaps.
The average gap size is 100 so if we reorder them as
[100, 99, 101, 98, 102, ..., 2, 198, 1, 199, 0, 200, 201, 202, ...]
and index this with a dense binary Fibonacci base encoding with no pairs of zeros (for example, 11011=8+5+2+1=16) with numbers delimited by '00' then I think we can keep the gap representation short enough, but it needs more analysis.
While receiving the stream do these steps.
1st set some reasonable chunk size
Pseudo Code idea:
The first step would be to find all the duplicates and stick them in a dictionary with its count and remove them.
The third step would be to place number that exist in sequence of their algorithmic steps and place them in counters special dictionaries with the first number and their step like n, n+1..., n+2, 2n, 2n+1, 2n+2...
Begin to compress in chunks some reasonable ranges of number like every 1000 or ever 10000 the remaining numbers that appear less often to repeat.
Uncompress that range if a number is found and add it to the range and leave it uncompressed for a while longer.
Otherwise just add that number to a byte[chunkSize]
Continue the first 4 steps while receiving the stream. The final step would be to either fail if you exceeded memory or start outputting the result once all the data is collected by beginning to sort the ranges and spit out the results in order and uncompressing those in order that need to be uncompressed and sort them when you get to them.

How does Google Protocol Buffers compare to ASN.1

What are the most noticable differences between Google Protocol Buffers and ASN.1 (with PER-encoding)? For my project the most imporant issue is the size of the serialized data. Has anyone done any data-size comparisons between the two?
If you use ASN.1 with Unaligned PER, and define your data types using the appropriate constraints (e.g., specifying lower/upper bounds for integers, upper bounds for the length of lists, etc.), your encodings will be very compact. There will be no bits wasted for things like alignment or padding between the fields, and each field will be encoded in the minimum number of bits necessary to hold its permitted range of values. For example, a field of type INTEGER (1..8) will be encoded in 3 bits (1='000', 2='001', ..., 8='111'); and a CHOICE with four alternatives will occupy 2 bits (indicating the chosen alternative) plus the bits occupied by the chosen alternative. ASN.1 has many other interesting features that have been successfully used in many published standards. An example is the extension marker ("..."), which when applied to SEQUENCE, CHOICE, ENUMERATED, and other types, enables backward- and forward compatibility between endpoints implementing different versions of the specification.
It's a long time since I've done any ASN.1 work, but the size is very likely to depend on the details of your types and actual data.
I would strongly recommend that you prototype both and put some real data in to compare.
If your protocol buffer would contain repeated primitive types, you should look at the latest source in Subversion for Protocol Buffers - they can be represented in a "packed" format now which is much more space-efficient. (My C# port has just caught up with this feature, some time last week.)
When size of the packed/encoded message is important you should also note the fact that protobuf is not able to pack repeated fields that are not of a primitive numeric type, read this for more information.
This is an issue e.g. if you have messages of that type: (comment defines actual range of values)
message P{
required sint32 x = 1; // -0x1ffff to 0x20000
required sint32 y = 2; // -0x1ffff to 0x20000
required sint32 z = 3; // -0x319c to 0x3200
}
message Array{
repeated P ps = 1;
optional uint32 somemoredata = 2;
}
In case you have an array length of, e.g., 32 than you would result in a packed message size of approximately 250 to 450 bytes with protobuf, depending on what values the array actually contains. This can even increase to over 1000 bytes in case you use the full 32bit range or in case you use int32 instead of sint32 and have negative values.
The raw data blob (assuming that z can be defined as int16 value) would only consume 320 bytes and thus the ASN.1 message is always smaller than 320 bytes since the max values are actually not 32bit but 19bit (x,y) and 15bit (z).
The protobuf message size can be optimized with this message definition:
message Ps{
repeated sint32 xs = 1 [packed=true];
repeated sint32 ys = 2 [packed=true];
repeated sint32 zs = 3 [packed=true];
}
message Array{
required Ps ps = 1;
optional uint32 somemoredata = 2;
}
which results in message sizes between approximately 100 byte (all values are zeros), 300 byte (values at range max), and 500 byte (all values are high 32bit values).
Protocol Buffers does not guarantee preservation of the order of fields in the binary encoding but ASN.1 does. It is not related to size so might not be the most noticeable in your use case but it is an important difference for comparison, for digital signatures, for simplified parsing, and possibly other applications.

How to calculate the entropy of a file?

How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.
My idea is the following:
Create an array of 256 integers (all zeros).
Traverse through the file and for each of its bytes,
increment the corresponding position in the array.
At the end: Calculate the "average" value for the array.
Initialize a counter with zero,
and for each of the array's entries:
add the entry's difference
to "average" to the counter.
Well, now I'm stuck. How to "project" the counter result in such a way
that all results would lie between 0.0 and 1.0? But I'm sure,
the idea is inconsistent anyway...
I hope someone has better and simpler solutions?
Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)
At the end: Calculate the "average" value for the array.
Initialize a counter with zero,
and for each of the array's entries:
add the entry's difference to "average" to the counter.
With some modifications you can get Shannon's entropy:
rename "average" to "entropy"
(float) entropy = 0
for i in the array[256]:Counts do
(float)p = Counts[i] / filesize
if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2
Edit:
As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).
A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).
This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.
To calculate the information entropy of a collection of bytes, you'll need to do something similar to tydok's answer. (tydok's answer works on a collection of bits.)
The following variables are assumed to already exist:
byte_counts is 256-element list of the number of bytes with each value in your file. For example, byte_counts[2] is the number of bytes that have the value 2.
total is the total number of bytes in your file.
I'll write the following code in Python, but it should be obvious what's going on.
import math
entropy = 0
for count in byte_counts:
# If no bytes of this value were seen in the value, it doesn't affect
# the entropy of the file.
if count == 0:
continue
# p is the probability of seeing this byte in the file, as a floating-
# point number
p = 1.0 * count / total
entropy -= p * math.log(p, 256)
There are several things that are important to note.
The check for count == 0 is not just an optimization. If count == 0, then p == 0, and log(p) will be undefined ("negative infinity"), causing an error.
The 256 in the call to math.log represents the number of discrete values that are possible. A byte composed of eight bits will have 256 possible values.
The resulting value will be between 0 (every single byte in the file is the same) up to 1 (the bytes are evenly divided among every possible value of a byte).
An explanation for the use of log base 256
It is true that this algorithm is usually applied using log base 2. This gives the resulting answer in bits. In such a case, you have a maximum of 8 bits of entropy for any given file. Try it yourself: maximize the entropy of the input by making byte_counts a list of all 1 or 2 or 100. When the bytes of a file are evenly distributed, you'll find that there is an entropy of 8 bits.
It is possible to use other logarithm bases. Using b=2 allows a result in bits, as each bit can have 2 values. Using b=10 puts the result in dits, or decimal bits, as there are 10 possible values for each dit. Using b=256 will give the result in bytes, as each byte can have one of 256 discrete values.
Interestingly, using log identities, you can work out how to convert the resulting entropy between units. Any result obtained in units of bits can be converted to units of bytes by dividing by 8. As an interesting, intentional side-effect, this gives the entropy as a value between 0 and 1.
In summary:
You can use various units to express entropy
Most people express entropy in bits (b=2)
For a collection of bytes, this gives a maximum entropy of 8 bits
Since the asker wants a result between 0 and 1, divide this result by 8 for a meaningful value
The algorithm above calculates entropy in bytes (b=256)
This is equivalent to (entropy in bits) / 8
This already gives a value between 0 and 1
I'm two years late in answering, so please consider this despite only a few up-votes.
Short answer: use my 1st and 3rd bold equations below to get what most people are thinking about when they say "entropy" of a file in bits. Use just 1st equation if you want Shannon's H entropy which is actually entropy/symbol as he stated 13 times in his paper which most people are not aware of. Some online entropy calculators use this one, but Shannon's H is "specific entropy", not "total entropy" which has caused so much confusion. Use 1st and 2nd equation if you want the answer between 0 and 1 which is normalized entropy/symbol (it's not bits/symbol, but a true statistical measure of the "entropic nature" of the data by letting the data choose its own log base instead of arbitrarily assigning 2, e, or 10).
There 4 types of entropy of files (data) of N symbols long with n unique types of symbols. But keep in mind that by knowing the contents of a file, you know the state it is in and therefore S=0. To be precise, if you have a source that generates a lot of data that you have access to, then you can calculate the expected future entropy/character of that source. If you use the following on a file, it is more accurate to say it is estimating the expected entropy of other files from that source.
Shannon (specific) entropy H = -1*sum(count_i / N * log(count_i / N))
where count_i is the number of times symbol i occured in N.
Units are bits/symbol if log is base 2, nats/symbol if natural log.
Normalized specific entropy: H / log(n)
Units are entropy/symbol. Ranges from 0 to 1. 1 means each symbol occurred equally often and near 0 is where all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. The log is in the same base as the H.
Absolute entropy S = N * H
Units are bits if log is base 2, nats if ln()).
Normalized absolute entropy S = N * H / log(n)
Unit is "entropy", varies from 0 to N. The log is in the same base as the H.
Although the last one is the truest "entropy", the first one (Shannon entropy H) is what all books call "entropy" without (the needed IMHO) qualification. Most do not clarify (like Shannon did) that it is bits/symbol or entropy per symbol. Calling H "entropy" is speaking too loosely.
For files with equal frequency of each symbol: S = N * H = N. This is the case for most large files of bits. Entropy does not do any compression on the data and is thereby completely ignorant of any patterns, so 000000111111 has the same H and S as 010111101000 (6 1's and 6 0's in both cases).
Like others have said, using a standard compression routine like gzip and dividing before and after will give a better measure of the amount of pre-existing "order" in the file, but that is biased against data that fits the compression scheme better. There's no general purpose perfectly optimized compressor that we can use to define an absolute "order".
Another thing to consider: H changes if you change how you express the data. H will be different if you select different groupings of bits (bits, nibbles, bytes, or hex). So you divide by log(n) where n is the number of unique symbols in the data (2 for binary, 256 for bytes) and H will range from 0 to 1 (this is normalized intensive Shannon entropy in units of entropy per symbol). But technically if only 100 of the 256 types of bytes occur, then n=100, not 256.
H is an "intensive" entropy, i.e. it is per symbol which is analogous to specific entropy in physics which is entropy per kg or per mole. Regular "extensive" entropy of a file analogous to physics' S is S=N*H where N is the number of symbols in the file. H would be exactly analogous to a portion of an ideal gas volume. Information entropy can't simply be made exactly equal to physical entropy in a deeper sense because physical entropy allows for "ordered" as well disordered arrangements: physical entropy comes out more than a completely random entropy (such as a compressed file). One aspect of the different For an ideal gas there is a additional 5/2 factor to account for this: S = k * N * (H+5/2) where H = possible quantum states per molecule = (xp)^3/hbar * 2 * sigma^2 where x=width of the box, p=total non-directional momentum in the system (calculated from kinetic energy and mass per molecule), and sigma=0.341 in keeping with uncertainty principle giving only the number of possible states within 1 std dev.
A little math gives a shorter form of normalized extensive entropy for a file:
S=N * H / log(n) = sum(count_i*log(N/count_i))/log(n)
Units of this are "entropy" (which is not really a unit). It is normalized to be a better universal measure than the "entropy" units of N * H. But it also should not be called "entropy" without clarification because the normal historical convention is to erringly call H "entropy" (which is contrary to the clarifications made in Shannon's text).
For what it's worth, here's the traditional (bits of entropy) calculation represented in C#:
/// <summary>
/// returns bits of entropy represented in a given string, per
/// http://en.wikipedia.org/wiki/Entropy_(information_theory)
/// </summary>
public static double ShannonEntropy(string s)
{
var map = new Dictionary<char, int>();
foreach (char c in s)
{
if (!map.ContainsKey(c))
map.Add(c, 1);
else
map[c] += 1;
}
double result = 0.0;
int len = s.Length;
foreach (var item in map)
{
var frequency = (double)item.Value / len;
result -= frequency * (Math.Log(frequency) / Math.Log(2));
}
return result;
}
Is this something that ent could handle? (Or perhaps its not available on your platform.)
$ dd if=/dev/urandom of=file bs=1024 count=10
$ ent file
Entropy = 7.983185 bits per byte.
...
As a counter example, here is a file with no entropy.
$ dd if=/dev/zero of=file bs=1024 count=10
$ ent file
Entropy = 0.000000 bits per byte.
...
There's no such thing as the entropy of a file. In information theory, the entropy is a function of a random variable, not of a fixed data set (well, technically a fixed data set does have an entropy, but that entropy would be 0 — we can regard the data as a random distribution that has only one possible outcome with probability 1).
In order to calculate the entropy, you need a random variable with which to model your file. The entropy will then be the entropy of the distribution of that random variable. This entropy will equal the number of bits of information contained in that random variable.
If you use information theory entropy, mind that it might make sense not to use it on bytes. Say, if your data consists of floats you should instead fit a probability distribution to those floats and calculate the entropy of that distribution.
Or, if the contents of the file is unicode characters, you should use those, etc.
Calculates entropy of any string of unsigned chars of size "length". This is basically a refactoring of the code found at http://rosettacode.org/wiki/Entropy. I use this for a 64 bit IV generator that creates a container of 100000000 IV's with no dupes and a average entropy of 3.9. http://www.quantifiedtechnologies.com/Programming.html
#include <string>
#include <map>
#include <algorithm>
#include <cmath>
typedef unsigned char uint8;
double Calculate(uint8 * input, int length)
{
std::map<char, int> frequencies;
for (int i = 0; i < length; ++i)
frequencies[input[i]] ++;
double infocontent = 0;
for (std::pair<char, int> p : frequencies)
{
double freq = static_cast<double>(p.second) / length;
infocontent += freq * log2(freq);
}
infocontent *= -1;
return infocontent;
}
Re: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)
As others have pointed out (or been confused/distracted by), I think you're actually talking about metric entropy (entropy divided by length of message). See more at Entropy (information theory) - Wikipedia.
jitter's comment linking to Scanning data for entropy anomalies is very relevant to your underlying goal. That links eventually to libdisorder (C library for measuring byte entropy). That approach would seem to give you lots more information to work with, since it shows how the metric entropy varies in different parts of the file. See e.g. this graph of how the entropy of a block of 256 consecutive bytes from a 4 MB jpg image (y axis) changes for different offsets (x axis). At the beginning and end the entropy is lower, as it part-way in, but it is about 7 bits per byte for most of the file.
Source: https://github.com/cyphunk/entropy_examples. [Note that this and other graphs are available via the novel http://nonwhiteheterosexualmalelicense.org license....]
More interesting is the analysis and similar graphs at Analysing the byte entropy of a FAT formatted disk | GL.IB.LY
Statistics like the max, min, mode, and standard deviation of the metric entropy for the whole file and/or the first and last blocks of it might be very helpful as a signature.
This book also seems relevant: Detection and Recognition of File Masquerading for E-mail and Data Security - Springer
Here's a Java algo based on this snippet and the invasion that took place during the infinity war
public static double shannon_entropy(File file) throws IOException {
byte[] bytes= Files.readAllBytes(file.toPath());//byte sequence
int max_byte = 255;//max byte value
int no_bytes = bytes.length;//file length
int[] freq = new int[256];//byte frequencies
for (int j = 0; j < no_bytes; j++) {
int value = bytes[j] & 0xFF;//integer value of byte
freq[value]++;
}
double entropy = 0.0;
for (int i = 0; i <= max_byte; i++) {
double p = 1.0 * freq[i] / no_bytes;
if (freq[i] > 0)
entropy -= p * Math.log(p) / Math.log(2);
}
return entropy;
}
usage-example:
File file=new File("C:\\Users\\Somewhere\\In\\The\\Omniverse\\Thanos Invasion.Log");
int file_length=(int)file.length();
double shannon_entropy=shannon_entropy(file);
System.out.println("file length: "+file_length+" bytes");
System.out.println("shannon entropy: "+shannon_entropy+" nats i.e. a minimum of "+shannon_entropy+" bits can be used to encode each byte transfer" +
"\nfrom the file so that in total we transfer atleast "+(file_length*shannon_entropy)+" bits ("+((file_length*shannon_entropy)/8D)+
" bytes instead of "+file_length+" bytes).");
output-example:
file length: 5412 bytes
shannon entropy: 4.537883805240875 nats i.e. a minimum of 4.537883805240875 bits can be used to encode each byte transfer
from the file so that in total we transfer atleast 24559.027153963616 bits (3069.878394245452 bytes instead of 5412 bytes).
Without any additional information entropy of a file is (by definition) equal to its size*8 bits. Entropy of text file is roughly size*6.6 bits, given that:
each character is equally probable
there are 95 printable characters in byte
log(95)/log(2) = 6.6
Entropy of text file in English is estimated to be around 0.6 to 1.3 bits per character (as explained here).
In general you cannot talk about entropy of a given file. Entropy is a property of a set of files.
If you need an entropy (or entropy per byte, to be exact) the best way is to compress it using gzip, bz2, rar or any other strong compression, and then divide compressed size by uncompressed size. It would be a great estimate of entropy.
Calculating entropy byte by byte as Nick Dandoulakis suggested gives a very poor estimate, because it assumes every byte is independent. In text files, for example, it is much more probable to have a small letter after a letter than a whitespace or punctuation after a letter, since words typically are longer than 2 characters. So probability of next character being in a-z range is correlated with value of previous character. Don't use Nick's rough estimate for any real data, use gzip compression ratio instead.

Resources