Compressing a vector of positive integers (int32) that have a specific order - algorithm

I'm trying to compress long vectors (their size ranges from 1 to 100 million elements). The vectors have positive integers with values ranging from 0 to 1 or 100 million (depending on the vector size). Hence, I'm using 32 bit integers to encompass the large numbers but that consumes too much storage.
The vectors have the following characteristic features:
All values are positive integers. Their range grows as the vector size grows.
Values are increasing but smaller numbers do intervene frequently (see the figure below).
None of the values before a specific index are larger than that index (Index starts at zero). For instance, none of the values that occur before the index of 6 are larger than 6. However, smaller values may repeat after that index. This holds true for the entire array.
I'm usually dealing with very long arrays. Hence, as the array length passes 1 million elements, the upcoming numbers are mostly large numbers mixed with previous reoccurring numbers. Shorter numbers usually re-occur more than larger numbers. New Larger numbers are added to the array as you pass through it.
Here is a sample of the values in the array: {initial padding..., 0, 1, 2, 3, 4, 5, 6, 4, 7, 4, 8, 9, 1, 10, ... later..., 1110, 11, 1597, 1545, 1392, 326, 1371, 1788, 541,...}
Here is a plot of a part of the vector:
What do I want? :
Because I'm using 32 bit integers this is wasting a lot of memory since smaller numbers that can be represented with less than 32 bit do repeat too. I want to compress this vector maximally to save memory (Ideally, by a factor of 3 because only a reduction by that amount or more will meet our needs!). What is the best compression algorithm to achieve that? Or is there away to take advantage of the array's characteristic features described above to reversibly convert the numbers in that array to 8 bit integers?
Things that I have tried or considered:
Delta encoding: This doesn't work here because the vector is not always increasing.
Huffman coding: Does not seem to help here since the range of unique numbers in the array is quite large, hence, the encoding table will be a large overhead.
Using variable Int encoding. i.e using 8 bit integers for smaller numbers and 16 bit for larger ones...etc. This has reduced the vector size to size*0.7 (not satisfactory since it doesn't take advantage of the specific characteristics described above)
I'm not quite sure if this method described in the following link is applicable to my data: http://ygdes.com/ddj-3r/ddj-3r_compact.html
I don't quite understand the method but it gives me the encouragement to try similar things because I think there is some order in the data that can be taken to its advantage.
For example, I tried to reassign any number(n) larger than 255 to n-255 so that I can keep the integers in 8 bit realm because I know that no number is larger than 255 before that index. However, I'm not able to distinguish the reassigned numbers with the repeated numbers... so this idea doesn't work unless doing some more tricks to reverse the re-assignments...
Here is the link to the fist 24000 elements of the data for those interested:
data
Any advice or suggestions are deeply appreciated. Thanks a lot in advance.
Edit1:
Here is a plot of the data after delta encoding. As you can see, it doesn't reduce the range!
Edit2:
I was hoping that I could find a pattern in the data that allows me to reversibly change the 32-bit vector to a single 8-bit vector but this seems very unlikely.
I have tried to decompose the 32-bit vector to 4 x 8-bit vectors, hoping that the decomposed vectors lend themselves to compression better.
Below are plots for the 4 vectors. Now their ranges are from 0-255.
What I did was to recursively divide each element in the vectors by 255 and store the reminder into another vector. To reconstruct the original array all I need to do is: ( ( (vec4*255) + vec3 )*255 + vec2 ) *255 + vec1...
As you can see, the last vector is all zeros for the current shown length of the data.. in fact, this should be zeros all the way to 2^24th element. This will be a 25% reduction if my total vector length was less than 16 million elements but since I'm dealing with much longer vectors this has a much smaller impact.
More importantly, the third vector seems also to have some compressible features as its values do increase by 1 after each 65,535 steps.
It does seem that now I can benefit from Huffman coding or variable bit encoding as suggested. Any suggestions that allows me to maximally compress this data are deeply appreciated.
Here I attached a bigger sample of the data if anyone is interested:
https://drive.google.com/file/d/10wO3-1j3NkQbaKTcr0nl55bOH9P-G1Uu/view?usp=sharing
Edit3:
I'm really thankful for all the given answers. I've learnt a lot from them. For those of you who are interested to tinker with a larger set of the data the following link has 11 million elements of a similar dataset (zipped 33MB)
https://drive.google.com/file/d/1Aohfu6II6OdN-CqnDll7DeHPgEDLMPjP/view
Once you unzip the data, you can use the following C++ snippet to read the data into a vector<int32_t>
const char* path = "path_to\compression_int32.txt";
std::vector<int32_t> newVector{};
std::ifstream ifs(path, std::ios::in | std::ifstream::binary);
std::istream_iterator<int32_t> iter{ ifs };
std::istream_iterator<int32_t> end{};
std::copy(iter, end, std::back_inserter(newVector));

It's easy to get better than a factor of two compression on your example data by using property 3, where I have taken property 3 to mean that every value must be less than its index, with the indices starting at 1. Simply use ceiling(log2(i)) bits to store the number at index i (where i starts at 1). For your first example with 24,977 values, that compresses it of 43% of the size of the vector using 32-bit integers.
The number of bits required depends only on the length of the vector, n. The number of bits is:
1 - 2ceiling(log2(n)) + n ceiling(log2(n))
As noted by Falk Hüffner, a simpler approach would be a fixed number of bits for all values of ceiling(log2(n)). A variable number of bits will always be less than that, but not much less than that for large n.
If it is common to have a run of zeros at the start, then compress those with a count. There are only a handful of runs of two or three numbers in the remainder, so run-length encoding won't help except for that initial run of zeros.
Another 2% or so (for large sets) could be shaved off using an arithmetic coding approach, considering each value at index k (indices starting at zero) to be a base k+1 digit of a very large integer. That would take ceiling(log2(n!)) bits.
Here is a plot of the compression ratios of the arithmetic coding, variable bits per sample coding, and fixed bits per sample coding, all ratioed to a representation with 32 bits for every sample (the sequence length is on a log scale):
The arithmetic approach requires multiplication and division on integers the length of the compressed data, which is monumentally slow for large vectors. The code below limits the size of the integers to 64 bits, at some cost to the compression ratio, in exchange for it being very fast. This code will give compression ratios about 0.2% to 0.7% more than arithmetic in the plot above, well below variable bits. The data vector must have the property that each value is non-negative
and that each value is less than its position (positions starting at one).
The compression effectiveness depends only on that property, plus a small reduction if there is an initial run of zeros.
There appears to be a bit more redundancy in the provided examples that this
compression approach does not exploit.
#include <vector>
#include <cmath>
// Append val, as a variable-length integer, to comp. val must be non-negative.
template <typename T>
void write_varint(T val, std::vector<uint8_t>& comp) {
while (val > 0x7f) {
comp.push_back(val & 0x7f);
val >>= 7;
}
comp.push_back(val | 0x80);
}
// Return the variable-length integer at offset off in comp, updating off to
// point after the integer.
template <typename T>
T read_varint(std::vector<uint8_t> const& comp, size_t& off) {
T val = 0, next;
int shift = 0;
for (;;) {
next = comp.at(off++);
if (next > 0x7f)
break;
val |= next << shift;
shift += 7;
}
val |= (next & 0x7f) << shift;
return val;
}
// Given the starting index i >= 1, find the optimal number of values to code
// into 64 bits or less, or up through index n-1, whichever comes first.
// Optimal is defined as the least amount of entropy lost by representing the
// group in an integral number of bits, divided by the number of bits. Return
// the optimal number of values in num, and the number of bits needed to hold
// an integer representing that group in len.
static void group_ar64(size_t i, size_t n, size_t& num, int& len) {
// Analyze all of the permitted groups, starting at index i.
double min = 1.;
uint64_t k = 1; // integer range is 0..k-1
auto j = i + 1;
do {
k *= j;
auto e = log2(k); // entropy of k possible integers
int b = ceil(e); // number of bits to hold 0..k-1
auto loss = (b - e) / b; // unused entropy per bit
if (loss < min) {
num = j - i; // best number of values so far
len = b; // bit length for that number
if (loss == 0.)
break; // not going to get any better
min = loss;
}
} while (j < n && k <= (uint64_t)-1 / ++j);
}
// Compress the data arithmetically coded as an incrementing base integer, but
// with a 64-bit limit on each integer. This puts values into groups that each
// fit in 64 bits, with the least amount of wasted entropy. Also compress the
// initial run of zeros into a count.
template <typename T>
std::vector<uint8_t> compress_ar64(std::vector<T> const& data) {
// Resulting compressed data vector.
std::vector<uint8_t> comp;
// Start with number of values to make the stream self-terminating.
write_varint(data.size(), comp);
if (data.size() == 0)
return comp;
// Run-length code the initial run of zeros. Write the number of contiguous
// zeros after the first one.
size_t i = 1;
while (i < data.size() && data[i] == 0)
i++;
write_varint(i - 1, comp);
// Compress the data into variable-base integers starting at index i, where
// each integer fits into 64 bits.
unsigned buf = 0; // output bit buffer
int bits = 0; // number of bits in buf (0..7)
while (i < data.size()) {
// Find the optimal number of values to code, starting at index i.
size_t num; int len;
group_ar64(i, data.size(), num, len);
// Code num values.
uint64_t code = 0;
size_t k = 1;
do {
code += k * data[i++];
k *= i;
} while (--num);
// Write code using len bits.
if (bits) {
comp.push_back(buf | (code << bits));
code >>= 8 - bits;
len -= 8 - bits;
}
while (len > 7) {
comp.push_back(code);
code >>= 8;
len -= 8;
}
buf = code;
bits = len;
}
if (bits)
comp.push_back(buf);
return comp;
}
// Decompress the result of compress_ar64(), returning the original values.
// Start decompression at offset off in comp. When done, off is updated to
// point just after the compressed data.
template <typename T>
std::vector<T> expand_ar64(std::vector<uint8_t> const& comp, size_t& off) {
// Will contain the uncompressed data to return.
std::vector<T> data;
// Get the number of values.
auto vals = read_varint<size_t>(comp, off);
if (vals == 0)
return data;
// Get the number of zeros after the first one, and write all of them.
auto run = read_varint<size_t>(comp, off) + 1;
auto i = run;
do {
data.push_back(0);
} while (--run);
// Extract the values from the compressed data starting at index i.
unsigned buf = 0; // input bit buffer
int bits = 0; // number of bits in buf (0..7)
while (i < vals) {
// Find the optimal number of values to code, starting at index i. This
// simply repeats the same calculation that was done when compressing.
size_t num; int len;
group_ar64(i, vals, num, len);
// Read len bits into code.
uint64_t code = buf;
while (bits + 8 < len) {
code |= (uint64_t)comp.at(off++) << bits;
bits += 8;
}
len -= bits; // bits to pull from last byte (1..8)
uint64_t last = comp.at(off++); // last byte
code |= (last & ((1 << len) - 1)) << bits;
buf = last >> len; // save remaining bits in buffer
bits = 8 - len;
// Extract num values from code.
do {
i++;
data.push_back(code % i);
code /= i;
} while (--num);
}
// Return the uncompressed data.
return data;
}

Solving every compression problem should begin with an analysis.
I looked at the raw data file containing the first 24976 values. The smallest value is 0 and the largest is 24950. The "slope" of the data is then around 1. However, It should decrease over time, if the maximum is, as told, only 33M#100M values. Assumption of slope=1 is then a bit pessimistic.
As for the distribution,
tr '[,]' '[\n]' <compression.txt | sort -n | uniq -c | sort -nr | head -n256
produces
164 0
131 8
111 1648
108 1342
104 725
103 11
91 1475
90 1446
82 21
82 1355
78 69
76 2
75 12
72 328
71 24
70 614
70 416
70 1608
70 1266
69 22
67 356
67 3
66 1444
65 19
65 1498
65 10
64 2056
64 16
64 1322
64 1182
63 249
63 1335
61 43
60 17
60 1469
59 33
59 3116
58 20
58 1201
57 303
55 5
55 4
55 2559
55 1324
54 1110
53 1984
53 1357
52 807
52 56
52 4321
52 2892
52 1
50 39
50 2475
49 1580
48 664
48 266
47 317
47 1255
46 981
46 37
46 3531
46 23
43 1923
43 1248
41 396
41 2349
40 7
39 6
39 54
39 4699
39 32
38 815
38 2006
38 194
38 1298
38 1284
37 44
37 1550
37 1369
37 1273
36 1343
35 61
35 3991
35 3606
35 1818
35 1701
34 836
34 27
34 264
34 241
34 1306
33 821
33 28
33 248
33 18
33 15
33 1017
32 9
32 68
32 53
32 240
32 1516
32 1474
32 1390
32 1312
32 1269
31 667
31 326
31 263
31 25
31 160
31 1253
30 3365
30 2082
30 18550
30 1185
30 1049
30 1018
29 73
29 487
29 48
29 4283
29 34
29 243
29 1605
29 1515
29 1470
29 1297
29 1183
28 980
28 60
28 302
28 242
28 1959
28 1779
28 161
27 811
27 51
27 36
27 201
27 1270
27 1267
26 979
26 50
26 40
26 3111
26 26
26 2425
26 1807
25 825
25 823
25 812
25 77
25 46
25 217
25 1842
25 1831
25 1534
25 1464
25 1321
24 730
24 66
24 59
24 427
24 355
24 1465
24 1299
24 1164
24 1111
23 941
23 892
23 7896
23 663
23 607
23 556
23 47
23 2887
23 251
23 1776
23 1583
23 1488
23 1349
23 1244
22 82
22 818
22 661
22 42
22 411
22 3337
22 3190
22 3028
22 30
22 2226
22 1861
22 1363
22 1301
22 1262
22 1158
21 74
21 49
21 41
21 376
21 354
21 2156
21 1688
21 162
21 1453
21 1067
21 1053
20 711
20 413
20 412
20 38
20 337
20 2020
20 1897
20 1814
20 17342
20 173
20 1256
20 1160
19 9169
19 83
19 679
19 4120
19 399
19 2306
19 2042
19 1885
19 163
19 1623
19 1380
18 805
18 79
18 70
18 6320
18 616
18 455
18 4381
18 4165
18 3761
18 35
18 2560
18 2004
18 1900
18 1670
18 1546
18 1291
18 1264
18 1181
17 824
17 8129
17 63
17 52
17 5138
as the most frequent 256 values.
It seems some values are inherently more common. When examined, those common values also seem to be distributed all over the data.
I propose the following:
Divide the data into blocks. For each block, send the actual value of the slope, so when coding each symbol we know its maximum value.
Code the common values in a block with statistical coding (Huffman etc.). In this case, the cutoff with an alphabet of 256 would be around 17 occurrences.
For less common values, we reserve a small part of the alphabet for sending the amount of bits in the value.
When we encounter a rare value, its bits are coded without statistical modeling. The topmost bit can be omitted, since we know it's always 1 (unless value is '0').
Usually the range of values to be coded is not a power-of-2. For example, if we have 10 choices, this requires 4 bits to code, but there are 6 unused bit patterns - sometimes we only need 3 bits. The first 6 choices we code directly with 3 bits. If it's 7 or 8, we send an extra bit to indicate if we meant 9 or 10.
Additionally, we could exclude any value that is directly coded from the list of possible values. Otherwise we have two ways to code the same value, which is redundant.

As I suggested in my comment you can represent your data as 8bit. There are simple ways on how to do it efficiently no need for modular arithmetics..
You can use union or pointers for this so for example in C++ if you have:
unsigned int data32[]={0,0,0,...};
unsigned char *data08=data32;
Or you can copy it to 4 BYTE array but that will be slower.
If you have to use modular arithmetics for any reasons then you might want to do it like this:
x &255
(x>> 8)&255
(x>>16)&255
(x>>24)&255
Now I have tried LZW on your new data and the compression ratio result without any data reordering (single LZW) was 81-82% (depending on dictionary size I suggest to use 10bit LZW dictionary) which is not as good as expected. So I reordered the data into 4 arrays (just like you did) so first array has lowest 8bits and last the highest. The results with 12 bit dictionary where:
ratio08: 144%
ratio08: 132%
ratio08: 26%
ratio08: 0%
total: 75%
The results with 10 bit dictionary where:
ratio08: 123%
ratio08: 117%
ratio08: 28%
ratio08: 0%
total: 67%
Showing that LZW is bad for lowest bytes (and with increasing size it will be worse for higher bytes too) So use it only for the higher BYTEs which would improve the compress ratio more.
However I expect huffman should lead to much better results so I computed entropy for your data:
H32 = 5.371071 , H/8 = 0.671384
H08 = 7.983666 , H/8 = 0.997958
H08 = 7.602564 , H/8 = 0.950321
H08 = 1.902525 , H/8 = 0.237816
H08 = 0.000000 , H/8 = 0.000000
total: 54%
meaning naive single huffman encoding would have compress ratio 67% and the separate 4 arrays would lead to 54% which is much better so in your case I would go for huffman encoding. After I implemented it here the result:
[Huffman]
ratio08 = 99.992%
ratio08 = 95.400%
ratio08 = 24.706%
ratio08 = 0.000%
total08 = 55.025%
ratio32 = 67.592%
Which closely matches the estimation by Shannon entropy as expected (not accounting the decoding table) ...
However with very big datasets I expect naive huffman will start to get slightly better than the separate 4x huffman ...
Also note that the result where truncated so those 0% are not zero but something less than 1% ...
[Edit1] 300 000 000 entries estimation
so to simulate the conditions for 300M 32bit numbers of yours I use 16bit numbers sub part of your data with similar "empty space" properties.
log2(300 000 000) = ~28
28/32 * 16 = 14
so I use only 2^14 16bit numbers which should have similar properties as your 300M 32 bit numbers The 8bit Huffman encoding leads to:
ratio08 = 97.980%
ratio08 = 59.534%
total08 = 78.757%
So I estimate 80% ratio between encoded/decoded sizes ~1.25 size reduction.
(Hope I did not screw something up with my assumptions).

The data you are dealing with is "nearly" sorted, so you can use that to great effect with delta encoding.
A simple approach is as follows:
Look for runs of data, denoted by R_i = (v,l,N) where l is the length of the run, N is the bit-depth needed to do delta encoding on the sorted run, and v is the value of the first element of the (sorted) run (needed for delta encoding.) The run itself then just needs to store 2 pieces of information for each entry in the run: the idx of each sorted element in the run and the delta. Note, to store the idx of each sorted element, only log_2(l) bits are needed per idx, where l is the length of the run.
The encoding works by attempting to find the least number of bits to fully encode the run when compared to the number of bytes used in its uncompressed form. In practice, this can be implemented by finding the longest run that is encoded for a fixed number of bytes per element.
To decode, simply decode run-by-run (in order) first decoding the delta coding/compression, then undoing the sort.
Here is some C++ code that computes the compression ratio that can be obtained using this scheme on the data sample you posted. The implementation takes a greedy approach in selecting the runs, it is possible slightly better results are available if a smarter approach is used.
#include <algorithm>
#include <cassert>
#include <cstdio>
#include <cstdlib>
#include <map>
#include <queue>
#include "data.h"
template <int _N, int _M> struct T {
constexpr static int N = _N;
constexpr static int M = _M;
uint16_t idx : N;
uint16_t delta : M;
};
template <int _N, int _M>
std::pair<int32_t, int32_t> best_packed_run_stats(size_t idx) {
const int N = 1 << _N;
const int M = 1 << _M;
static std::vector<int32_t> buffer(N);
if (idx + N >= data.size())
return {-1, 0};
std::copy(&data[idx], &data[idx + N], buffer.data());
std::sort(buffer.begin(), buffer.end());
int32_t run_len = 0;
for (size_t i = 1; i < N; ++i, ++run_len) {
auto delta = buffer[i] - buffer[i - 1];
assert(delta >= 0);
if (delta >= M) {
break;
}
}
int32_t savings = run_len * (sizeof(int32_t) - sizeof(T<_N, _M>)) -
1 // 1 byte to store bit-depth
- 2; // 2 bytes to store run length
return {savings, run_len};
}
template <class... Args>
std::vector<std::pair<int32_t, int32_t>> all_runs_stats(size_t idx) {
return {best_packed_run_stats<Args::N, Args::M>(idx)...};
}
int main() {
size_t total_savings = 0;
for (size_t i = 0; i < data.size(); ++i) {
auto runs =
all_runs_stats<T<2, 14>, T<4, 12>, T<8, 8>, T<12, 4>, T<14, 2>>(i);
auto best_value = *std::max_element(runs.begin(), runs.end());
total_savings += best_value.first;
i += best_value.second;
}
size_t uncomp_size = data.size() * sizeof(int32_t);
double comp_ratio =
(uncomp_size - (double)total_savings) / (double)uncomp_size;
printf("uncomp_size: %lu\n", uncomp_size);
printf("compression: %lf\n", comp_ratio);
printf("size: %lu\n", data.size());
}
Note, only certain fixed configurations of 16-bit representations of elements in a run are attempted. Because of this we should expect the best possible compression we can achieve is 50% (i.e. 4 bytes -> 2 bytes.) In reality, there is overhead.
This code when run on the data sample you supplied reports this compression ration:
uncomp_size: 99908
compression: 0.505785
size: 24977
which is very close to the theoretical limit of .5 for this compression algorithm.
Also, note, that this slightly beats out the Shannon entropy estimate reported in another answer.
Edit to address Mark Adler's comment below.
Re-running this compression on the larger data-set provided (compression2.txt) along with comparing to Mark Adler's approach here are the results:
uncomp_size: 2602628
compression: 0.507544
size: 650657
bit compression: 0.574639
Where bit compression is the compression ratio of Mark Adler's approach. As noted by others, compressing the bits of each entry will not scale well for large data, we should expect the ratio to get worse with n.
Meanwhile the delta + sorting compression described above maintains close to its theoretical best of .5.

Related

Algorithm for visiting all grid cells in pseudo-random order that has a guaranteed uniformity at any stage

Context:
I have a hydraulic erosion algorithm that needs to receive an array of droplet starting positions. I also already have a pattern replicating algorithm, so I only need a good pattern to replicate.
The Requirements:
I need an algorism that produces a set of n^2 entries in a set of format (x,y) or [index] that describe cells in an nxn grid (where n = 2^i where i is any positive integer).
(as a set it means that every cell is mentioned in exactly one entry)
The pattern [created by the algorism ] should contain zero to none clustering of "visited" cells at any stage.
The cell (0,0) is as close to (n-1,n-1) as to (1,1), this relates to the definition of clustering
Note
I was/am trying to find solutions through fractal-like patterns built through recursion, but at the time of writing this, my solution is a lookup table of a checkerboard pattern(list of black cells + list of white cells) (which is bad, but yields fewer artifacts than an ordered list)
C, C++, C#, Java implementations (if any) are preferred
You can use a linear congruential generator to create an even distribution across your n×n space. For example, if you have a 64×64 grid, using a stride of 47 will create the pattern on the left below. (Run on jsbin) The cells are visited from light to dark.
That pattern does not cluster, but it is rather uniform. It uses a simple row-wide transformation where
k = (k + 47) mod (n * n)
x = k mod n
y = k div n
You can add a bit of randomness by making k the index of a space-filling curve such as the Hilbert curve. This will yield the pattern on the right. (Run on jsbin)
     
     
You can see the code in the jsbin links.
I have solved the problem myself and just sharing my solution:
here are my outputs for the i between 0 and 3:
power: 0
ordering:
0
matrix visit order:
0
power: 1
ordering:
0 3 2 1
matrix visit order:
0 3
2 1
power: 2
ordering:
0 10 8 2 5 15 13 7 4 14 12 6 1 11 9 3
matrix visit order:
0 12 3 15
8 4 11 7
2 14 1 13
10 6 9 5
power: 3
ordering:
0 36 32 4 18 54 50 22 16 52 48 20 2 38 34 6
9 45 41 13 27 63 59 31 25 61 57 29 11 47 43 15
8 44 40 12 26 62 58 30 24 60 56 28 10 46 42 14
1 37 33 5 19 55 51 23 17 53 49 21 3 39 35 7
matrix visit order:
0 48 12 60 3 51 15 63
32 16 44 28 35 19 47 31
8 56 4 52 11 59 7 55
40 24 36 20 43 27 39 23
2 50 14 62 1 49 13 61
34 18 46 30 33 17 45 29
10 58 6 54 9 57 5 53
42 26 38 22 41 25 37 21
the code:
public static int[] GetPattern(int power, int maxReturnSize = int.MaxValue)
{
int sideLength = 1 << power;
int cellsNumber = sideLength * sideLength;
int[] ret = new int[cellsNumber];
for ( int i = 0 ; i < cellsNumber && i < maxReturnSize ; i++ ) {
// this loop's body can be used for per-request computation
int x = 0;
int y = 0;
for ( int p = power - 1 ; p >= 0 ; p-- ) {
int temp = (i >> (p * 2)) % 4; //2 bits of the index starting from the begining
int a = temp % 2; // the first bit
int b = temp >> 1; // the second bit
x += a << power - 1 - p;
y += (a ^ b) << power - 1 - p;// ^ is XOR
// 00=>(0,0), 01 =>(1,1) 10 =>(0,1) 11 =>(1,0) scaled to 2^p where 0<=p
}
//to index
int index = y * sideLength + x;
ret[i] = index;
}
return ret;
}
I do admit that somewhere along the way the values got transposed, but it does not matter because of how it works.
After doing some optimization I came up with this loop body:
int x = 0;
int y = 0;
for ( int p = 0 ; p < power ; p++ ) {
int temp = ( i >> ( p * 2 ) ) & 3;
int a = temp & 1;
int b = temp >> 1;
x = ( x << 1 ) | a;
y = ( y << 1 ) | ( a ^ b );
}
int index = y * sideLength + x;
(the code assumes that c# optimizer, IL2CPP, and CPP compiler will optimize variables temp, a, b out)

Understanding branch prediction efficiency

I tried to measure branch prediction cost, I created a little program.
It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.
Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:
N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9
For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.
My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?
(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)
Here's the code:
#include <cstdint>
#include <iostream>
volatile uint64_t init[2] = { 314159165, 27182818 };
// volatile uint64_t init[2] = { 0, 0 };
volatile uint64_t one = 1;
uint64_t next(uint64_t s[2]) {
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;
}
int main() {
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];
uint64_t sum = 0;
#if 1
const int N = 16;
unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;
for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) {
for (int j=0; j<1<<N; j++) {
if (buffer[j]) {
sum += one;
}
}
}
#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++) {
if (next(s)&1) {
sum += one;
}
}
#endif
std::cout<<sum<<"\n";
}
(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)
Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):
401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>
Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:
N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9
Note: branch-misses (%) is calculated for 2^32 branches
As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.
The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.
The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):
(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

How do I make this program work for input >10 for the USACO Training Pages Square Palindromes?

Problem Statement -
Given a number base B (2 <= B <= 20 base 10), print all the integers N (1 <= N <= 300 base 10) such that the square of N is palindromic when expressed in base B; also print the value of that palindromic square. Use the letters 'A', 'B', and so on to represent the digits 10, 11, and so on.
Print both the number and its square in base B.
INPUT FORMAT
A single line with B, the base (specified in base 10).
SAMPLE INPUT
10
OUTPUT FORMAT
Lines with two integers represented in base B. The first integer is the number whose square is palindromic; the second integer is the square itself. NOTE WELL THAT BOTH INTEGERS ARE IN BASE B!
SAMPLE OUTPUT
1 1
2 4
3 9
11 121
22 484
26 676
101 10201
111 12321
121 14641
202 40804
212 44944
264 69696
My code works for all inputs <=10, however, gives me some weird output for inputs >10.
My Code-
#include<iostream>
#include<cstdio>
#include<cmath>
using namespace std;
int baseToBase(int num, int base) //accepts a number in base 10 and the base to be converted into as arguments
{
int result=0, temp=0, i=1;
while(num>0)
{
result = result + (num%base)*pow(10, i);
i++;
num = num/base;
}
result/=10;
return result;
}
long long int isPalin(int n, int base) //checks the palindrome
{
long long int result=0, temp, num=n*n, x=n*n;
num = baseToBase(num, base);
x = baseToBase(x, base);
while(num)
{
temp=num%10;
result = result*10 + temp;
num/=10;
}
if(x==result)
return x;
else
return 0;
}
int main()
{
int base, i, temp;
long long int sq;
cin >> base;
for(i=1; i<=300; i++)
{
temp=baseToBase(i, base);
sq=isPalin(i, base);
if(sq!=0)
cout << temp << " " << sq << endl;
}
return 0;
}
For input = 11, the answer should be
1 1
2 4
3 9
6 33
11 121
22 484
24 565
66 3993
77 5335
101 10201
111 12321
121 14641
202 40804
212 44944
234 53535
While my answer is
1 1
2 4
3 9
6 33
11 121
22 484
24 565
66 3993
77 5335
110 10901
101 10201
111 12321
121 14641
209 40304
202 40804
212 44944
227 50205
234 53535
There is a difference in my output and the required one as 202 shows under 209 and 110 shows up before 101.
Help appreciated, thanks!
a simple example for B = 11 to show error in your base conversion is for i = 10 temp should be A but your code calculates temp = 10. Cause in we have only 10 symbols 0-9 to perfectly show every number in base 10 or lower but for bases greater than that you have to use other symbols to represent a different digit like 'A', 'B' and so on. problem description clearly states that. Hope You will be able to fix your code now by modifying your int baseToBase(int num, int base)function.

Algorithm to find a number that meets a gt (greater than condition) the fastest

I have to check for the tipping point that a number causes a type of overflow.
If we assume for example that the overflow number is 98, then a very inefficient way of doing that would be to start at 1 and increment 1 at a time. This would take 98 comparisons.
I punched out a better way of doing this as so
What it basically does change the check to the next power of two after a known failing condition, for example we know that 0 fails so we start checking at 1, then 2,4,8,...,128. 128 passes so we check 64+1,64+2,64+4,...,64+32, which passes but we know that 64+16 failed so we start the next round at 1+(64+16)===1+80. Here's a visual:
1 1
2 2
3 4
4 8
5 16
6 32
7 64
81 128 ->
9 1, 64 // 1 + 64
10 2, 64
11 4, 64
12 8, 64
13 16, 64
14 32, 64 ->
15 1, 80
16 2, 80
17 4, 80
18 8, 80
19 16, 80
20 32, 80 ->
21 1, 96
22 2, 96 // done
Is there some better way of doing this?
If you do not know the max number, I think going with your initial approach to find the MIN=64, MAX=128 range is good. Doing a binary search AFTER you find a min/max will be most efficient (eg., look at 96, if it causes overflow, then you know the range is MIN=64, MAX=96). You keep halving the range at each step, you will find solution faster.
Since 98 was your answer, here is how it would pan out with a binary search. This takes 13 steps instead of 22:
// your initial approach
1 1
2 2
3 4
4 8
5 16
6 32
7 64
8 128 ->
// range found, so start binary search
9 (64,128) -> 96
10 (96,128) -> 112
11 (96,112) -> 104
12 (96,104) -> 100
13 (96,100) -> 98 // done
// you may need to do step 14 here to validate that 97 does not cause overflow
// -- depends on your exact requirement
If you know that the "overflow function" is monotonically increasing, you can keep doubling until you go over, and then apply the classic binary search algorithm. This would give you the following search sequence:
1
2
4
8
16
32
64
128 -> over - we have the ends of our range
Run the binary search in [64..128] range
64..128, mid = 96
96..128, mid = 112
96..112, mid = 104
96..104, mid = 100
96..100, mid = 98
96..98, mid = 97
97 - no overflow ==> 98 is the answer
Here's how I implemented this technique in javascript:
function findGreatest(shouldPassCallback) {
function findRange(knownGood, test) {
if (!shouldPassCallback(test)) {
return [knownGood, test];
} else {
return findRange(test, test * 2);
}
}
function binarySearchCompare(min, max) {
if (min > max) {
throw 'Huh?';
}
if (min === max) { return shouldPassCallback(min) ? min : min - 1; }
if (max - min === 1) { return shouldPassCallback(max) ? max : min }
var mid = ~~((min + max) / 2);
if (shouldPassCallback(mid)) {
return binarySearchCompare(mid, max);
} else {
return binarySearchCompare(min, mid);
}
}
var range = findRange(0, 1);
return binarySearchCompare(range[0], range[1]);
}

How can I fairly choose an item from a list?

Let's say that I have a list of prizes:
PrizeA
PrizeB
PrizeC
And, for each of them, I want to draw a winner from a list of my attendees.
Give that my attendee list is as follows:
user1, user2, user3, user4, user5
What is an unbiased way to choose a user from that list?
Clearly, I will be using a cryptographically secure pseudo-random number generator, but how do I avoid a bias towards the front of the list? I assume I will not be using modulus?
EDIT
So, here is what I came up with:
class SecureRandom
{
private RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
private ulong NextUlong()
{
byte[] data = new byte[8];
rng.GetBytes(data);
return BitConverter.ToUInt64(data, 0);
}
public int Next()
{
return (int)(NextUlong() % (ulong)int.MaxValue);
}
public int Next(int maxValue)
{
if (maxValue < 0)
{
throw new ArgumentOutOfRangeException("maxValue");
}
if (maxValue == 0)
{
return 0;
}
ulong chop = ulong.MaxValue - (ulong.MaxValue % (ulong)maxValue);
ulong rand;
do
{
rand = NextUlong();
} while (rand >= chop);
return (int)(rand % (ulong)maxValue);
}
}
BEWARE:
Next() Returns an int in the range [0, int.MaxValue]
Next(int.MaxValue) Returns an int in the range [0, int.MaxValue)
Pseudocode for special random number generator:
rng is random number generator produces uniform integers from [0, max)
compute m = max modulo length of attendee list
do {
draw a random number r from rng
} while(r >= max - m)
return r modulo length of attendee list
This eliminates the bias to the front part of the list. Then
put the attendees in some data structure indexable by integers
for every prize in the prize list
draw a random number r using above
compute index = r modulo length of attendee list
return the attendee at index
In C#:
public NextUnbiased(Random rg, int max) {
do {
int r = rg.Next();
} while(r >= Int32.MaxValue - (Int32.MaxValue % max));
return r % max;
}
public Attendee SelectWinner(IList<Attendee> attendees, Random rg) {
int winningAttendeeIndex = NextUnbiased(rg, attendees.Length)
return attendees[winningAttendeeIndex];
}
Then:
// attendees is list of attendees
// rg is Random
foreach(Prize prize in prizes) {
Attendee winner = SelectWinner(attendees, rg);
Console.WriteLine("Prize {0} won by {1}", prize.ToString(), winner.ToString());
}
Assuming a fairly distributed random number generator...
do {
i = rand();
} while (i >= RAND_MAX / 5 * 5);
i /= 5;
This gives each of 5 slots
[ 0 .. RAND_MAX / 5 )
[ RAND_MAX / 5 .. RAND_MAX / 5 * 2 )
[ RAND_MAX / 5 * 2 .. RAND_MAX / 5 * 3 )
[ RAND_MAX / 5 * 3 .. RAND_MAX / 5 * 4 )
[ RAND_MAX / 5 * 4 .. RAND_MAX / 5 * 5 )
and discards a roll which falls out of range.
You have already seem several perfectly good answers that depend on knowing the length of the list in advance.
To fairly select a single item from a list without needing to know the length of the list in the first place do this:
if (list.empty()) error_out_somehow
r=list.first() // r is a reference or pointer
s=list.first() // so is s
i = 2
while (r.next() is not NULL)
r=r.next()
if (random(i)==0) s=r // random() returns a uniformly
// drawn integer between 0 and i
i++
return s
(Useful if you list is stored as a linked list)
To distribute prizes in this scenario, just walk down the list of prizes selecting a random winner for each one. (If you want to prevent double winning you then remove the winner from the participant list.)
Why does it work?
You start with the first item at 1/1
On the next pass, you select the second item half the time (1/2), which means that the first item has probability 1 * (2-1)/2 = 1/2
on further iteration, you select the nth item with probability 1/n, and the chance for each previous item is reduced by a factor of (n-1)/n
which means that when you come to the end, the chance of having the mth item in the list (of n items) is
1/m * m/(m+1) * (m+1)/(m+2) * ... * (n-2)/(n-1) * (n-1)/n = 1/n
and is the same for every item.
If you are paying attention, you'll note that this means walking the whole list every time you want to select an item from the list, so this is not maximally efficient for (say) reordering the whole list (though it does that fairly).
I suppose one answer would be to assign each item a random value, and take the largest or smallest, drilling down as necessary.
I'm not sure if this is the most efficient, tho...
If you're using a good number generator, even with a modulus your bias will be miniscule. If, for instance, you're using a random number generator with 64 bits of entropy and five users, your bias toward the front of the array should be on the order of 3x10^-19 (my numbers may be off, by I don't think by much). That's an extra 3-in-10-quintillion likelihood of the first user winning compared to the later users. That should be good enough to be fair in anyone's book.
You can buy truly random bits from a provider, or use a mechanical device.
Here you will find Oleg Kiselyov's discussion of purely functional random shuffling.
A description of the linked content (quoted from the beginning of that article):
This article will give two pure functional programs that perfectly,
randomly and uniformly shuffle a sequence of arbitrary elements. We
prove that the algorithms are correct. The algorithms are implemented
in Haskell and can trivially be re-written into other (functional)
languages. We also discuss why a commonly used sort-based shuffle
algorithm falls short of perfect shuffling.
You could use that to shuffle your list and then pick the first item of the shuffled result (or maybe you'd prefer not to give two prizes two the same person -- then use n initial positions of the result, for n = number of prizes); or you could simplify the algorithm to just produce the first item; or you could take a look around that site, because I could have sworn there's an article on picking one random element from an arbitrary tree-like structure with uniform distribution, in a purely functional way, proof of correctness provided, but my search-fu is failing me and I can't seem to find it.
Without truly random bits, you will always have some bias. The number of ways to assign prizes to guests is much larger than any common PRNG's period for even a fairly low number of guests and prizes. As suggested by lpthnc, buy some truly random bits, or buy some random-bit-generating hardware.
As for the algorithm, just do a random shuffle of the guest list. Be careful, as naive shuffling algorithms do have a bias: http://en.wikipedia.org/wiki/Shuffling#Shuffling_algorithms
You can 100% reliably pick a random item from any arbitrary list with a single pass and without knowing how many items are in the list ahead of time.
Psuedo Code:
count = 0.0;
item_selected = none;
foreach item in list
count = count + 1.0;
chance = 1.0 / count;
if ( random( 1.0 ) <= chance ) then item_selected = item;
Test program comparing results of a single rand() % N vs iterating as above:
#include "stdafx.h"
#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
static inline float frand01()
{
return (float)rand() / (float)RAND_MAX;
}
int _tmain(int argc, _TCHAR* argv[])
{
static const int NUM_ITEMS = 50;
int resultRand[NUM_ITEMS];
int resultIterate[NUM_ITEMS];
memset( resultRand, 0, NUM_ITEMS * sizeof(int) );
memset( resultIterate, 0, NUM_ITEMS * sizeof(int) );
for ( int i = 0; i < 100000; i++ )
{
int choiceRand = rand() % NUM_ITEMS;
int choiceIterate = 0;
float count = 0.0;
for ( int item = 0; item < NUM_ITEMS; item++ )
{
count = count + 1.0f;
float chance = 1.0f / count;
if ( frand01() <= chance )
{
choiceIterate = item;
}
}
resultRand[choiceRand]++;
resultIterate[choiceIterate]++;
}
printf("Results:\n");
for ( int i = 0; i < NUM_ITEMS; i++ )
{
printf( "%02d - %5d %5d\n", i, resultRand[i], resultIterate[i] );
}
return 0;
}
Output:
Results:
00 - 2037 2050
01 - 2038 2009
02 - 2094 1986
03 - 2007 1953
04 - 1990 2142
05 - 1867 1962
06 - 1941 1997
07 - 2023 1967
08 - 1998 2070
09 - 1930 1953
10 - 1972 1900
11 - 2013 1985
12 - 1982 2001
13 - 1955 2063
14 - 1952 2022
15 - 1955 1976
16 - 2000 2044
17 - 1976 1997
18 - 2117 1887
19 - 1978 2020
20 - 1886 1934
21 - 1982 2065
22 - 1978 1948
23 - 2039 1894
24 - 1946 2010
25 - 1983 1927
26 - 1965 1927
27 - 2052 1964
28 - 2026 2021
29 - 2090 1993
30 - 2039 2016
31 - 2030 2009
32 - 1970 2094
33 - 2036 2048
34 - 2020 2046
35 - 2010 1998
36 - 2104 2041
37 - 2115 2019
38 - 1959 1986
39 - 1998 2031
40 - 2041 1977
41 - 1937 2060
42 - 1946 2048
43 - 2014 1986
44 - 1979 2072
45 - 2060 2002
46 - 2046 1913
47 - 1995 1970
48 - 1959 2020
49 - 1970 1997

Resources