How can I fairly choose an item from a list? - algorithm

Let's say that I have a list of prizes:
PrizeA
PrizeB
PrizeC
And, for each of them, I want to draw a winner from a list of my attendees.
Give that my attendee list is as follows:
user1, user2, user3, user4, user5
What is an unbiased way to choose a user from that list?
Clearly, I will be using a cryptographically secure pseudo-random number generator, but how do I avoid a bias towards the front of the list? I assume I will not be using modulus?
EDIT
So, here is what I came up with:
class SecureRandom
{
private RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
private ulong NextUlong()
{
byte[] data = new byte[8];
rng.GetBytes(data);
return BitConverter.ToUInt64(data, 0);
}
public int Next()
{
return (int)(NextUlong() % (ulong)int.MaxValue);
}
public int Next(int maxValue)
{
if (maxValue < 0)
{
throw new ArgumentOutOfRangeException("maxValue");
}
if (maxValue == 0)
{
return 0;
}
ulong chop = ulong.MaxValue - (ulong.MaxValue % (ulong)maxValue);
ulong rand;
do
{
rand = NextUlong();
} while (rand >= chop);
return (int)(rand % (ulong)maxValue);
}
}
BEWARE:
Next() Returns an int in the range [0, int.MaxValue]
Next(int.MaxValue) Returns an int in the range [0, int.MaxValue)

Pseudocode for special random number generator:
rng is random number generator produces uniform integers from [0, max)
compute m = max modulo length of attendee list
do {
draw a random number r from rng
} while(r >= max - m)
return r modulo length of attendee list
This eliminates the bias to the front part of the list. Then
put the attendees in some data structure indexable by integers
for every prize in the prize list
draw a random number r using above
compute index = r modulo length of attendee list
return the attendee at index
In C#:
public NextUnbiased(Random rg, int max) {
do {
int r = rg.Next();
} while(r >= Int32.MaxValue - (Int32.MaxValue % max));
return r % max;
}
public Attendee SelectWinner(IList<Attendee> attendees, Random rg) {
int winningAttendeeIndex = NextUnbiased(rg, attendees.Length)
return attendees[winningAttendeeIndex];
}
Then:
// attendees is list of attendees
// rg is Random
foreach(Prize prize in prizes) {
Attendee winner = SelectWinner(attendees, rg);
Console.WriteLine("Prize {0} won by {1}", prize.ToString(), winner.ToString());
}

Assuming a fairly distributed random number generator...
do {
i = rand();
} while (i >= RAND_MAX / 5 * 5);
i /= 5;
This gives each of 5 slots
[ 0 .. RAND_MAX / 5 )
[ RAND_MAX / 5 .. RAND_MAX / 5 * 2 )
[ RAND_MAX / 5 * 2 .. RAND_MAX / 5 * 3 )
[ RAND_MAX / 5 * 3 .. RAND_MAX / 5 * 4 )
[ RAND_MAX / 5 * 4 .. RAND_MAX / 5 * 5 )
and discards a roll which falls out of range.

You have already seem several perfectly good answers that depend on knowing the length of the list in advance.
To fairly select a single item from a list without needing to know the length of the list in the first place do this:
if (list.empty()) error_out_somehow
r=list.first() // r is a reference or pointer
s=list.first() // so is s
i = 2
while (r.next() is not NULL)
r=r.next()
if (random(i)==0) s=r // random() returns a uniformly
// drawn integer between 0 and i
i++
return s
(Useful if you list is stored as a linked list)
To distribute prizes in this scenario, just walk down the list of prizes selecting a random winner for each one. (If you want to prevent double winning you then remove the winner from the participant list.)
Why does it work?
You start with the first item at 1/1
On the next pass, you select the second item half the time (1/2), which means that the first item has probability 1 * (2-1)/2 = 1/2
on further iteration, you select the nth item with probability 1/n, and the chance for each previous item is reduced by a factor of (n-1)/n
which means that when you come to the end, the chance of having the mth item in the list (of n items) is
1/m * m/(m+1) * (m+1)/(m+2) * ... * (n-2)/(n-1) * (n-1)/n = 1/n
and is the same for every item.
If you are paying attention, you'll note that this means walking the whole list every time you want to select an item from the list, so this is not maximally efficient for (say) reordering the whole list (though it does that fairly).

I suppose one answer would be to assign each item a random value, and take the largest or smallest, drilling down as necessary.
I'm not sure if this is the most efficient, tho...

If you're using a good number generator, even with a modulus your bias will be miniscule. If, for instance, you're using a random number generator with 64 bits of entropy and five users, your bias toward the front of the array should be on the order of 3x10^-19 (my numbers may be off, by I don't think by much). That's an extra 3-in-10-quintillion likelihood of the first user winning compared to the later users. That should be good enough to be fair in anyone's book.

You can buy truly random bits from a provider, or use a mechanical device.

Here you will find Oleg Kiselyov's discussion of purely functional random shuffling.
A description of the linked content (quoted from the beginning of that article):
This article will give two pure functional programs that perfectly,
randomly and uniformly shuffle a sequence of arbitrary elements. We
prove that the algorithms are correct. The algorithms are implemented
in Haskell and can trivially be re-written into other (functional)
languages. We also discuss why a commonly used sort-based shuffle
algorithm falls short of perfect shuffling.
You could use that to shuffle your list and then pick the first item of the shuffled result (or maybe you'd prefer not to give two prizes two the same person -- then use n initial positions of the result, for n = number of prizes); or you could simplify the algorithm to just produce the first item; or you could take a look around that site, because I could have sworn there's an article on picking one random element from an arbitrary tree-like structure with uniform distribution, in a purely functional way, proof of correctness provided, but my search-fu is failing me and I can't seem to find it.

Without truly random bits, you will always have some bias. The number of ways to assign prizes to guests is much larger than any common PRNG's period for even a fairly low number of guests and prizes. As suggested by lpthnc, buy some truly random bits, or buy some random-bit-generating hardware.
As for the algorithm, just do a random shuffle of the guest list. Be careful, as naive shuffling algorithms do have a bias: http://en.wikipedia.org/wiki/Shuffling#Shuffling_algorithms

You can 100% reliably pick a random item from any arbitrary list with a single pass and without knowing how many items are in the list ahead of time.
Psuedo Code:
count = 0.0;
item_selected = none;
foreach item in list
count = count + 1.0;
chance = 1.0 / count;
if ( random( 1.0 ) <= chance ) then item_selected = item;
Test program comparing results of a single rand() % N vs iterating as above:
#include "stdafx.h"
#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
static inline float frand01()
{
return (float)rand() / (float)RAND_MAX;
}
int _tmain(int argc, _TCHAR* argv[])
{
static const int NUM_ITEMS = 50;
int resultRand[NUM_ITEMS];
int resultIterate[NUM_ITEMS];
memset( resultRand, 0, NUM_ITEMS * sizeof(int) );
memset( resultIterate, 0, NUM_ITEMS * sizeof(int) );
for ( int i = 0; i < 100000; i++ )
{
int choiceRand = rand() % NUM_ITEMS;
int choiceIterate = 0;
float count = 0.0;
for ( int item = 0; item < NUM_ITEMS; item++ )
{
count = count + 1.0f;
float chance = 1.0f / count;
if ( frand01() <= chance )
{
choiceIterate = item;
}
}
resultRand[choiceRand]++;
resultIterate[choiceIterate]++;
}
printf("Results:\n");
for ( int i = 0; i < NUM_ITEMS; i++ )
{
printf( "%02d - %5d %5d\n", i, resultRand[i], resultIterate[i] );
}
return 0;
}
Output:
Results:
00 - 2037 2050
01 - 2038 2009
02 - 2094 1986
03 - 2007 1953
04 - 1990 2142
05 - 1867 1962
06 - 1941 1997
07 - 2023 1967
08 - 1998 2070
09 - 1930 1953
10 - 1972 1900
11 - 2013 1985
12 - 1982 2001
13 - 1955 2063
14 - 1952 2022
15 - 1955 1976
16 - 2000 2044
17 - 1976 1997
18 - 2117 1887
19 - 1978 2020
20 - 1886 1934
21 - 1982 2065
22 - 1978 1948
23 - 2039 1894
24 - 1946 2010
25 - 1983 1927
26 - 1965 1927
27 - 2052 1964
28 - 2026 2021
29 - 2090 1993
30 - 2039 2016
31 - 2030 2009
32 - 1970 2094
33 - 2036 2048
34 - 2020 2046
35 - 2010 1998
36 - 2104 2041
37 - 2115 2019
38 - 1959 1986
39 - 1998 2031
40 - 2041 1977
41 - 1937 2060
42 - 1946 2048
43 - 2014 1986
44 - 1979 2072
45 - 2060 2002
46 - 2046 1913
47 - 1995 1970
48 - 1959 2020
49 - 1970 1997

Related

Compressing a vector of positive integers (int32) that have a specific order

I'm trying to compress long vectors (their size ranges from 1 to 100 million elements). The vectors have positive integers with values ranging from 0 to 1 or 100 million (depending on the vector size). Hence, I'm using 32 bit integers to encompass the large numbers but that consumes too much storage.
The vectors have the following characteristic features:
All values are positive integers. Their range grows as the vector size grows.
Values are increasing but smaller numbers do intervene frequently (see the figure below).
None of the values before a specific index are larger than that index (Index starts at zero). For instance, none of the values that occur before the index of 6 are larger than 6. However, smaller values may repeat after that index. This holds true for the entire array.
I'm usually dealing with very long arrays. Hence, as the array length passes 1 million elements, the upcoming numbers are mostly large numbers mixed with previous reoccurring numbers. Shorter numbers usually re-occur more than larger numbers. New Larger numbers are added to the array as you pass through it.
Here is a sample of the values in the array: {initial padding..., 0, 1, 2, 3, 4, 5, 6, 4, 7, 4, 8, 9, 1, 10, ... later..., 1110, 11, 1597, 1545, 1392, 326, 1371, 1788, 541,...}
Here is a plot of a part of the vector:
What do I want? :
Because I'm using 32 bit integers this is wasting a lot of memory since smaller numbers that can be represented with less than 32 bit do repeat too. I want to compress this vector maximally to save memory (Ideally, by a factor of 3 because only a reduction by that amount or more will meet our needs!). What is the best compression algorithm to achieve that? Or is there away to take advantage of the array's characteristic features described above to reversibly convert the numbers in that array to 8 bit integers?
Things that I have tried or considered:
Delta encoding: This doesn't work here because the vector is not always increasing.
Huffman coding: Does not seem to help here since the range of unique numbers in the array is quite large, hence, the encoding table will be a large overhead.
Using variable Int encoding. i.e using 8 bit integers for smaller numbers and 16 bit for larger ones...etc. This has reduced the vector size to size*0.7 (not satisfactory since it doesn't take advantage of the specific characteristics described above)
I'm not quite sure if this method described in the following link is applicable to my data: http://ygdes.com/ddj-3r/ddj-3r_compact.html
I don't quite understand the method but it gives me the encouragement to try similar things because I think there is some order in the data that can be taken to its advantage.
For example, I tried to reassign any number(n) larger than 255 to n-255 so that I can keep the integers in 8 bit realm because I know that no number is larger than 255 before that index. However, I'm not able to distinguish the reassigned numbers with the repeated numbers... so this idea doesn't work unless doing some more tricks to reverse the re-assignments...
Here is the link to the fist 24000 elements of the data for those interested:
data
Any advice or suggestions are deeply appreciated. Thanks a lot in advance.
Edit1:
Here is a plot of the data after delta encoding. As you can see, it doesn't reduce the range!
Edit2:
I was hoping that I could find a pattern in the data that allows me to reversibly change the 32-bit vector to a single 8-bit vector but this seems very unlikely.
I have tried to decompose the 32-bit vector to 4 x 8-bit vectors, hoping that the decomposed vectors lend themselves to compression better.
Below are plots for the 4 vectors. Now their ranges are from 0-255.
What I did was to recursively divide each element in the vectors by 255 and store the reminder into another vector. To reconstruct the original array all I need to do is: ( ( (vec4*255) + vec3 )*255 + vec2 ) *255 + vec1...
As you can see, the last vector is all zeros for the current shown length of the data.. in fact, this should be zeros all the way to 2^24th element. This will be a 25% reduction if my total vector length was less than 16 million elements but since I'm dealing with much longer vectors this has a much smaller impact.
More importantly, the third vector seems also to have some compressible features as its values do increase by 1 after each 65,535 steps.
It does seem that now I can benefit from Huffman coding or variable bit encoding as suggested. Any suggestions that allows me to maximally compress this data are deeply appreciated.
Here I attached a bigger sample of the data if anyone is interested:
https://drive.google.com/file/d/10wO3-1j3NkQbaKTcr0nl55bOH9P-G1Uu/view?usp=sharing
Edit3:
I'm really thankful for all the given answers. I've learnt a lot from them. For those of you who are interested to tinker with a larger set of the data the following link has 11 million elements of a similar dataset (zipped 33MB)
https://drive.google.com/file/d/1Aohfu6II6OdN-CqnDll7DeHPgEDLMPjP/view
Once you unzip the data, you can use the following C++ snippet to read the data into a vector<int32_t>
const char* path = "path_to\compression_int32.txt";
std::vector<int32_t> newVector{};
std::ifstream ifs(path, std::ios::in | std::ifstream::binary);
std::istream_iterator<int32_t> iter{ ifs };
std::istream_iterator<int32_t> end{};
std::copy(iter, end, std::back_inserter(newVector));
It's easy to get better than a factor of two compression on your example data by using property 3, where I have taken property 3 to mean that every value must be less than its index, with the indices starting at 1. Simply use ceiling(log2(i)) bits to store the number at index i (where i starts at 1). For your first example with 24,977 values, that compresses it of 43% of the size of the vector using 32-bit integers.
The number of bits required depends only on the length of the vector, n. The number of bits is:
1 - 2ceiling(log2(n)) + n ceiling(log2(n))
As noted by Falk Hüffner, a simpler approach would be a fixed number of bits for all values of ceiling(log2(n)). A variable number of bits will always be less than that, but not much less than that for large n.
If it is common to have a run of zeros at the start, then compress those with a count. There are only a handful of runs of two or three numbers in the remainder, so run-length encoding won't help except for that initial run of zeros.
Another 2% or so (for large sets) could be shaved off using an arithmetic coding approach, considering each value at index k (indices starting at zero) to be a base k+1 digit of a very large integer. That would take ceiling(log2(n!)) bits.
Here is a plot of the compression ratios of the arithmetic coding, variable bits per sample coding, and fixed bits per sample coding, all ratioed to a representation with 32 bits for every sample (the sequence length is on a log scale):
The arithmetic approach requires multiplication and division on integers the length of the compressed data, which is monumentally slow for large vectors. The code below limits the size of the integers to 64 bits, at some cost to the compression ratio, in exchange for it being very fast. This code will give compression ratios about 0.2% to 0.7% more than arithmetic in the plot above, well below variable bits. The data vector must have the property that each value is non-negative
and that each value is less than its position (positions starting at one).
The compression effectiveness depends only on that property, plus a small reduction if there is an initial run of zeros.
There appears to be a bit more redundancy in the provided examples that this
compression approach does not exploit.
#include <vector>
#include <cmath>
// Append val, as a variable-length integer, to comp. val must be non-negative.
template <typename T>
void write_varint(T val, std::vector<uint8_t>& comp) {
while (val > 0x7f) {
comp.push_back(val & 0x7f);
val >>= 7;
}
comp.push_back(val | 0x80);
}
// Return the variable-length integer at offset off in comp, updating off to
// point after the integer.
template <typename T>
T read_varint(std::vector<uint8_t> const& comp, size_t& off) {
T val = 0, next;
int shift = 0;
for (;;) {
next = comp.at(off++);
if (next > 0x7f)
break;
val |= next << shift;
shift += 7;
}
val |= (next & 0x7f) << shift;
return val;
}
// Given the starting index i >= 1, find the optimal number of values to code
// into 64 bits or less, or up through index n-1, whichever comes first.
// Optimal is defined as the least amount of entropy lost by representing the
// group in an integral number of bits, divided by the number of bits. Return
// the optimal number of values in num, and the number of bits needed to hold
// an integer representing that group in len.
static void group_ar64(size_t i, size_t n, size_t& num, int& len) {
// Analyze all of the permitted groups, starting at index i.
double min = 1.;
uint64_t k = 1; // integer range is 0..k-1
auto j = i + 1;
do {
k *= j;
auto e = log2(k); // entropy of k possible integers
int b = ceil(e); // number of bits to hold 0..k-1
auto loss = (b - e) / b; // unused entropy per bit
if (loss < min) {
num = j - i; // best number of values so far
len = b; // bit length for that number
if (loss == 0.)
break; // not going to get any better
min = loss;
}
} while (j < n && k <= (uint64_t)-1 / ++j);
}
// Compress the data arithmetically coded as an incrementing base integer, but
// with a 64-bit limit on each integer. This puts values into groups that each
// fit in 64 bits, with the least amount of wasted entropy. Also compress the
// initial run of zeros into a count.
template <typename T>
std::vector<uint8_t> compress_ar64(std::vector<T> const& data) {
// Resulting compressed data vector.
std::vector<uint8_t> comp;
// Start with number of values to make the stream self-terminating.
write_varint(data.size(), comp);
if (data.size() == 0)
return comp;
// Run-length code the initial run of zeros. Write the number of contiguous
// zeros after the first one.
size_t i = 1;
while (i < data.size() && data[i] == 0)
i++;
write_varint(i - 1, comp);
// Compress the data into variable-base integers starting at index i, where
// each integer fits into 64 bits.
unsigned buf = 0; // output bit buffer
int bits = 0; // number of bits in buf (0..7)
while (i < data.size()) {
// Find the optimal number of values to code, starting at index i.
size_t num; int len;
group_ar64(i, data.size(), num, len);
// Code num values.
uint64_t code = 0;
size_t k = 1;
do {
code += k * data[i++];
k *= i;
} while (--num);
// Write code using len bits.
if (bits) {
comp.push_back(buf | (code << bits));
code >>= 8 - bits;
len -= 8 - bits;
}
while (len > 7) {
comp.push_back(code);
code >>= 8;
len -= 8;
}
buf = code;
bits = len;
}
if (bits)
comp.push_back(buf);
return comp;
}
// Decompress the result of compress_ar64(), returning the original values.
// Start decompression at offset off in comp. When done, off is updated to
// point just after the compressed data.
template <typename T>
std::vector<T> expand_ar64(std::vector<uint8_t> const& comp, size_t& off) {
// Will contain the uncompressed data to return.
std::vector<T> data;
// Get the number of values.
auto vals = read_varint<size_t>(comp, off);
if (vals == 0)
return data;
// Get the number of zeros after the first one, and write all of them.
auto run = read_varint<size_t>(comp, off) + 1;
auto i = run;
do {
data.push_back(0);
} while (--run);
// Extract the values from the compressed data starting at index i.
unsigned buf = 0; // input bit buffer
int bits = 0; // number of bits in buf (0..7)
while (i < vals) {
// Find the optimal number of values to code, starting at index i. This
// simply repeats the same calculation that was done when compressing.
size_t num; int len;
group_ar64(i, vals, num, len);
// Read len bits into code.
uint64_t code = buf;
while (bits + 8 < len) {
code |= (uint64_t)comp.at(off++) << bits;
bits += 8;
}
len -= bits; // bits to pull from last byte (1..8)
uint64_t last = comp.at(off++); // last byte
code |= (last & ((1 << len) - 1)) << bits;
buf = last >> len; // save remaining bits in buffer
bits = 8 - len;
// Extract num values from code.
do {
i++;
data.push_back(code % i);
code /= i;
} while (--num);
}
// Return the uncompressed data.
return data;
}
Solving every compression problem should begin with an analysis.
I looked at the raw data file containing the first 24976 values. The smallest value is 0 and the largest is 24950. The "slope" of the data is then around 1. However, It should decrease over time, if the maximum is, as told, only 33M#100M values. Assumption of slope=1 is then a bit pessimistic.
As for the distribution,
tr '[,]' '[\n]' <compression.txt | sort -n | uniq -c | sort -nr | head -n256
produces
164 0
131 8
111 1648
108 1342
104 725
103 11
91 1475
90 1446
82 21
82 1355
78 69
76 2
75 12
72 328
71 24
70 614
70 416
70 1608
70 1266
69 22
67 356
67 3
66 1444
65 19
65 1498
65 10
64 2056
64 16
64 1322
64 1182
63 249
63 1335
61 43
60 17
60 1469
59 33
59 3116
58 20
58 1201
57 303
55 5
55 4
55 2559
55 1324
54 1110
53 1984
53 1357
52 807
52 56
52 4321
52 2892
52 1
50 39
50 2475
49 1580
48 664
48 266
47 317
47 1255
46 981
46 37
46 3531
46 23
43 1923
43 1248
41 396
41 2349
40 7
39 6
39 54
39 4699
39 32
38 815
38 2006
38 194
38 1298
38 1284
37 44
37 1550
37 1369
37 1273
36 1343
35 61
35 3991
35 3606
35 1818
35 1701
34 836
34 27
34 264
34 241
34 1306
33 821
33 28
33 248
33 18
33 15
33 1017
32 9
32 68
32 53
32 240
32 1516
32 1474
32 1390
32 1312
32 1269
31 667
31 326
31 263
31 25
31 160
31 1253
30 3365
30 2082
30 18550
30 1185
30 1049
30 1018
29 73
29 487
29 48
29 4283
29 34
29 243
29 1605
29 1515
29 1470
29 1297
29 1183
28 980
28 60
28 302
28 242
28 1959
28 1779
28 161
27 811
27 51
27 36
27 201
27 1270
27 1267
26 979
26 50
26 40
26 3111
26 26
26 2425
26 1807
25 825
25 823
25 812
25 77
25 46
25 217
25 1842
25 1831
25 1534
25 1464
25 1321
24 730
24 66
24 59
24 427
24 355
24 1465
24 1299
24 1164
24 1111
23 941
23 892
23 7896
23 663
23 607
23 556
23 47
23 2887
23 251
23 1776
23 1583
23 1488
23 1349
23 1244
22 82
22 818
22 661
22 42
22 411
22 3337
22 3190
22 3028
22 30
22 2226
22 1861
22 1363
22 1301
22 1262
22 1158
21 74
21 49
21 41
21 376
21 354
21 2156
21 1688
21 162
21 1453
21 1067
21 1053
20 711
20 413
20 412
20 38
20 337
20 2020
20 1897
20 1814
20 17342
20 173
20 1256
20 1160
19 9169
19 83
19 679
19 4120
19 399
19 2306
19 2042
19 1885
19 163
19 1623
19 1380
18 805
18 79
18 70
18 6320
18 616
18 455
18 4381
18 4165
18 3761
18 35
18 2560
18 2004
18 1900
18 1670
18 1546
18 1291
18 1264
18 1181
17 824
17 8129
17 63
17 52
17 5138
as the most frequent 256 values.
It seems some values are inherently more common. When examined, those common values also seem to be distributed all over the data.
I propose the following:
Divide the data into blocks. For each block, send the actual value of the slope, so when coding each symbol we know its maximum value.
Code the common values in a block with statistical coding (Huffman etc.). In this case, the cutoff with an alphabet of 256 would be around 17 occurrences.
For less common values, we reserve a small part of the alphabet for sending the amount of bits in the value.
When we encounter a rare value, its bits are coded without statistical modeling. The topmost bit can be omitted, since we know it's always 1 (unless value is '0').
Usually the range of values to be coded is not a power-of-2. For example, if we have 10 choices, this requires 4 bits to code, but there are 6 unused bit patterns - sometimes we only need 3 bits. The first 6 choices we code directly with 3 bits. If it's 7 or 8, we send an extra bit to indicate if we meant 9 or 10.
Additionally, we could exclude any value that is directly coded from the list of possible values. Otherwise we have two ways to code the same value, which is redundant.
As I suggested in my comment you can represent your data as 8bit. There are simple ways on how to do it efficiently no need for modular arithmetics..
You can use union or pointers for this so for example in C++ if you have:
unsigned int data32[]={0,0,0,...};
unsigned char *data08=data32;
Or you can copy it to 4 BYTE array but that will be slower.
If you have to use modular arithmetics for any reasons then you might want to do it like this:
x &255
(x>> 8)&255
(x>>16)&255
(x>>24)&255
Now I have tried LZW on your new data and the compression ratio result without any data reordering (single LZW) was 81-82% (depending on dictionary size I suggest to use 10bit LZW dictionary) which is not as good as expected. So I reordered the data into 4 arrays (just like you did) so first array has lowest 8bits and last the highest. The results with 12 bit dictionary where:
ratio08: 144%
ratio08: 132%
ratio08: 26%
ratio08: 0%
total: 75%
The results with 10 bit dictionary where:
ratio08: 123%
ratio08: 117%
ratio08: 28%
ratio08: 0%
total: 67%
Showing that LZW is bad for lowest bytes (and with increasing size it will be worse for higher bytes too) So use it only for the higher BYTEs which would improve the compress ratio more.
However I expect huffman should lead to much better results so I computed entropy for your data:
H32 = 5.371071 , H/8 = 0.671384
H08 = 7.983666 , H/8 = 0.997958
H08 = 7.602564 , H/8 = 0.950321
H08 = 1.902525 , H/8 = 0.237816
H08 = 0.000000 , H/8 = 0.000000
total: 54%
meaning naive single huffman encoding would have compress ratio 67% and the separate 4 arrays would lead to 54% which is much better so in your case I would go for huffman encoding. After I implemented it here the result:
[Huffman]
ratio08 = 99.992%
ratio08 = 95.400%
ratio08 = 24.706%
ratio08 = 0.000%
total08 = 55.025%
ratio32 = 67.592%
Which closely matches the estimation by Shannon entropy as expected (not accounting the decoding table) ...
However with very big datasets I expect naive huffman will start to get slightly better than the separate 4x huffman ...
Also note that the result where truncated so those 0% are not zero but something less than 1% ...
[Edit1] 300 000 000 entries estimation
so to simulate the conditions for 300M 32bit numbers of yours I use 16bit numbers sub part of your data with similar "empty space" properties.
log2(300 000 000) = ~28
28/32 * 16 = 14
so I use only 2^14 16bit numbers which should have similar properties as your 300M 32 bit numbers The 8bit Huffman encoding leads to:
ratio08 = 97.980%
ratio08 = 59.534%
total08 = 78.757%
So I estimate 80% ratio between encoded/decoded sizes ~1.25 size reduction.
(Hope I did not screw something up with my assumptions).
The data you are dealing with is "nearly" sorted, so you can use that to great effect with delta encoding.
A simple approach is as follows:
Look for runs of data, denoted by R_i = (v,l,N) where l is the length of the run, N is the bit-depth needed to do delta encoding on the sorted run, and v is the value of the first element of the (sorted) run (needed for delta encoding.) The run itself then just needs to store 2 pieces of information for each entry in the run: the idx of each sorted element in the run and the delta. Note, to store the idx of each sorted element, only log_2(l) bits are needed per idx, where l is the length of the run.
The encoding works by attempting to find the least number of bits to fully encode the run when compared to the number of bytes used in its uncompressed form. In practice, this can be implemented by finding the longest run that is encoded for a fixed number of bytes per element.
To decode, simply decode run-by-run (in order) first decoding the delta coding/compression, then undoing the sort.
Here is some C++ code that computes the compression ratio that can be obtained using this scheme on the data sample you posted. The implementation takes a greedy approach in selecting the runs, it is possible slightly better results are available if a smarter approach is used.
#include <algorithm>
#include <cassert>
#include <cstdio>
#include <cstdlib>
#include <map>
#include <queue>
#include "data.h"
template <int _N, int _M> struct T {
constexpr static int N = _N;
constexpr static int M = _M;
uint16_t idx : N;
uint16_t delta : M;
};
template <int _N, int _M>
std::pair<int32_t, int32_t> best_packed_run_stats(size_t idx) {
const int N = 1 << _N;
const int M = 1 << _M;
static std::vector<int32_t> buffer(N);
if (idx + N >= data.size())
return {-1, 0};
std::copy(&data[idx], &data[idx + N], buffer.data());
std::sort(buffer.begin(), buffer.end());
int32_t run_len = 0;
for (size_t i = 1; i < N; ++i, ++run_len) {
auto delta = buffer[i] - buffer[i - 1];
assert(delta >= 0);
if (delta >= M) {
break;
}
}
int32_t savings = run_len * (sizeof(int32_t) - sizeof(T<_N, _M>)) -
1 // 1 byte to store bit-depth
- 2; // 2 bytes to store run length
return {savings, run_len};
}
template <class... Args>
std::vector<std::pair<int32_t, int32_t>> all_runs_stats(size_t idx) {
return {best_packed_run_stats<Args::N, Args::M>(idx)...};
}
int main() {
size_t total_savings = 0;
for (size_t i = 0; i < data.size(); ++i) {
auto runs =
all_runs_stats<T<2, 14>, T<4, 12>, T<8, 8>, T<12, 4>, T<14, 2>>(i);
auto best_value = *std::max_element(runs.begin(), runs.end());
total_savings += best_value.first;
i += best_value.second;
}
size_t uncomp_size = data.size() * sizeof(int32_t);
double comp_ratio =
(uncomp_size - (double)total_savings) / (double)uncomp_size;
printf("uncomp_size: %lu\n", uncomp_size);
printf("compression: %lf\n", comp_ratio);
printf("size: %lu\n", data.size());
}
Note, only certain fixed configurations of 16-bit representations of elements in a run are attempted. Because of this we should expect the best possible compression we can achieve is 50% (i.e. 4 bytes -> 2 bytes.) In reality, there is overhead.
This code when run on the data sample you supplied reports this compression ration:
uncomp_size: 99908
compression: 0.505785
size: 24977
which is very close to the theoretical limit of .5 for this compression algorithm.
Also, note, that this slightly beats out the Shannon entropy estimate reported in another answer.
Edit to address Mark Adler's comment below.
Re-running this compression on the larger data-set provided (compression2.txt) along with comparing to Mark Adler's approach here are the results:
uncomp_size: 2602628
compression: 0.507544
size: 650657
bit compression: 0.574639
Where bit compression is the compression ratio of Mark Adler's approach. As noted by others, compressing the bits of each entry will not scale well for large data, we should expect the ratio to get worse with n.
Meanwhile the delta + sorting compression described above maintains close to its theoretical best of .5.

Algorithm to give a value to a 5 card Poker hand

I am developing a poker game as college project and our current assignment is to write an algorithm to score a hand of 5 cards, so that the scores of two hands can be compared to each other to determine which is the better hand. The score of a hand has nothing to do with the probability of what hands could be made upon the draw being dealt with random cards, etc. - The score of a hand is based solely on the 5 cards in the hand, and no other cards in the deck.
The example solution we were given was to give a default score for each type of Poker hand, with the score reflecting how good the hand is - like this for instance:
//HAND TYPES:
ROYAL_FLUSH = 900000
STRAIGHT_FLUSH = 800000
...
TWO_PAIR = 200000
ONE_PAR = 100000
Then if two hands of the same type are compared, the values of the cards in the hands should be factored into the hand's score.
So for example, the following formula could be used to score a hand:
HAND_TYPE + (each card value in the hand)^(the number of occurences of that value)
So, for a Full House of three Queens and two 7s, the score would be:
600000 + 12^3 + 7^2
This formula works for the most part, but I have determined that in some instances, two similar hands can return the exact same score, when one should actually beat the other. An example of this is:
hand1 = 4C, 6C, 6H, JS, KC
hand2 = 3H, 4H, 7C, 7D, 8H
These two hands both have one pair, so their respective scores are:
100000 + 4^1 + 6^2 + 11^1 + 13^1 = 100064
100000 + 3^1 + 4^1 + 7^2 + 8^1 = 100064
This results in a draw, when clearly a pair of 7s trumps a pair of 6s.
How can I improve this formula, or even, what is a better formula I can use?
By the way, in my code, hands are stored in an array of each card's value in ascending order, for example:
[2H, 6D, 10C, KS, AS]
EDIT:
Here is my final solution thanks to the answers below:
/**
* Sorts cards by putting the "most important" cards first, and the rest in decreasing order.
* e.g. High Hand: KS, 9S, 8C, 4D, 2H
* One Pair: 3S, 3D, AH, 7S, 2C
* Full House: 6D, 6C, 6S, JC, JH
* Flush: KH, 9H, 7H, 6H, 3H
*/
private void sort() {
Arrays.sort(hand, Collections.reverseOrder()); // Initially sorts cards in descending order of game value
if (isFourOfAKind()) { // Then adjusts for hands where the "most important" cards
sortFourOfAKind(); // must come first
} else if (isFullHouse()) {
sortFullHouse();
} else if (isThreeOfAKind()) {
sortThreeOfAKind();
} else if (isTwoPair()) {
sortTwoPair();
} else if (isOnePair()){
sortOnePair();
}
}
private void sortFourOfAKind() {
if (hand[0].getGameValue() != hand[HAND_SIZE - 4].getGameValue()) { // If the four of a kind are the last four cards
swapCardsByIndex(0, HAND_SIZE - 1); // swap the first and last cards
} // e.g. AS, 9D, 9H, 9S, 9C => 9C, 9D, 9H, 9S, AS
}
private void sortFullHouse() {
if (hand[0].getGameValue() != hand[HAND_SIZE - 3].getGameValue()) { // If the 3 of a kind cards are the last three
swapCardsByIndex(0, HAND_SIZE - 2); // swap cards 1 and 4, 2 and 5
swapCardsByIndex(HAND_SIZE - 4, HAND_SIZE - 1); // e.g. 10D, 10C, 6H, 6S, 6D => 6S, 6D, 6H, 10D, 10C
}
}
private void sortThreeOfAKind() { // If the 3 of a kind cards are the middle 3 cards
if (hand[0].getGameValue() != hand[HAND_SIZE - 3].getGameValue() && hand[HAND_SIZE - 1].getGameValue() != hand[HAND_SIZE - 3].getGameValue()) { // swap cards 1 and 4
swapCardsByIndex(0, HAND_SIZE - 2); // e.g. AH, 8D, 8S, 8C, 7D => 8C, 8D, 8S, AH, 7D
} else if (hand[0].getGameValue() != hand[HAND_SIZE - 3].getGameValue() && hand[HAND_SIZE - 4].getGameValue() != hand[HAND_SIZE - 3].getGameValue()) {
Arrays.sort(hand); // If the 3 of a kind cards are the last 3,
swapCardsByIndex(HAND_SIZE - 1, HAND_SIZE - 2); // reverse the order (smallest game value to largest)
} // then swap the last two cards (maintain the large to small ordering)
} // e.g. KS, 9D, 3C, 3S, 3H => 3H, 3S, 3C, 9D, KS => 3H, 3S, 3C, KS, 9D
private void sortTwoPair() {
if (hand[0].getGameValue() != hand[HAND_SIZE - 4].getGameValue()) { // If the two pairs are the last 4 cards
for (int i = 0; i < HAND_SIZE - 1; i++) { // "bubble" the first card to the end
swapCardsByIndex(i, i + 1); // e.g. AH, 7D, 7S, 6H, 6C => 7D, 7S, 6H, 6C, AH
}
} else if (hand[0].getGameValue() == hand[HAND_SIZE - 4].getGameValue() && hand[HAND_SIZE - 2].getGameValue() == hand[HAND_SIZE - 1].getGameValue()) { // If the two pairs are the first and last two cards
swapCardsByIndex(HAND_SIZE - 3, HAND_SIZE - 1); // swap the middle and last card
} // e.g. JS, JC, 8D, 4H, 4S => JS, JC, 4S, 4H, 8D
}
private void sortOnePair() { // If the pair are cards 2 and 3, swap cards 1 and 3
if (hand[HAND_SIZE - 4].getGameValue() == hand[HAND_SIZE - 3].getGameValue()) { // e.g QD, 8H, 8C, 6S, 4J => 8C, 8H, QD, 6S, 4J
swapCardsByIndex(0, HAND_SIZE - 3);
} else if (hand[HAND_SIZE - 3].getGameValue() == hand[HAND_SIZE - 2].getGameValue()) { // If the pair are cards 3 and 4, swap 1 and 3, 2 and 4
swapCardsByIndex(0, HAND_SIZE - 3); // e.g. 10S, 8D, 4C, 4H, 2H => 4C, 4H, 10S, 8D, 2H
swapCardsByIndex(HAND_SIZE - 4, HAND_SIZE - 2);
} else if (hand[HAND_SIZE - 2].getGameValue() == hand[HAND_SIZE - 1].getGameValue()) { // If the pair are the last 2 cards, reverse the order
Arrays.sort(hand); // and then swap cards 3 and 5
swapCardsByIndex(HAND_SIZE - 3, HAND_SIZE - 1); // e.g. 9H, 7D, 6C, 3D, 3S => 3S, 3D, 6C, 7D, 9H => 3S, 3D, 9H, 7D, 6C
}
}
/**
* Swaps the two cards of the hand at the indexes taken as parameters
* #param index1
* #param index2
*/
private void swapCardsByIndex(int index1, int index2) {
PlayingCard temp = hand[index1];
hand[index1] = hand[index2];
hand[index2] = temp;
}
/**
* Gives a unique value of any hand, based firstly on the type of hand, and then on the cards it contains
* #return The Game Value of this hand
*
* Firstly, a 24 bit binary string is created where the most significant 4 bits represent the value of the type of hand
* (defined as constants private to this class), the last 20 bits represent the values of the 5 cards in the hand, where
* the "most important" cards are at greater significant places. Finally, the binary string is converter to an integer.
*/
public int getGameValue() {
String handValue = addPaddingToBinaryString(Integer.toBinaryString(getHandValue()));
for (int i = 0; i < HAND_SIZE; i++) {
handValue += addPaddingToBinaryString(Integer.toBinaryString(getCardValue(hand[i])));
}
return Integer.parseInt(handValue, 2);
}
/**
* #param binary
* #return the same binary string padded to 4 bits long
*/
private String addPaddingToBinaryString(String binary) {
switch (binary.length()) {
case 1: return "000" + binary;
case 2: return "00" + binary;
case 3: return "0" + binary;
default: return binary;
}
}
/**
* #return Default value for the type of hand
*/
private int getHandValue() {
if (isRoyalFlush()) { return ROYAL_FLUSH_VALUE; }
if (isStraightFlush()) { return STRAIGHT_FLUSH_VALUE; }
if (isFourOfAKind()) { return FOUR_OF_A_KIND_VALUE; }
if (isFullHouse()) { return FULL_HOUSE_VALUE; }
if (isFlush()) { return FLUSH_VALUE; }
if (isStraight()) { return STRAIGHT_VALUE; }
if (isThreeOfAKind()) { return THREE_OF_A_KIND_VALUE; }
if (isTwoPair()) { return TWO_PAIR_VALUE; }
if (isOnePair()) { return ONE_PAIR_VALUE; }
return 0;
}
/**
* #param card
* #return the value for a given card type, used to calculate the Hand's Game Value
* 2H = 0, 3D = 1, 4S = 2, ... , KC = 11, AH = 12
*/
private int getCardValue(PlayingCard card) {
return card.getGameValue() - 2;
}
There are 10 recognized poker hands:
9 - Royal flush
8 - Straight flush (special case of royal flush, really)
7 - Four of a kind
6 - Full house
5 - Flush
4 - Straight
3 - Three of a kind
2 - Two pair
1 - Pair
0 - High card
If you don't count suit, there are only 13 possible card values. The card values are:
2 - 0
3 - 1
4 - 2
5 - 3
6 - 4
7 - 5
8 - 6
9 - 7
10 - 8
J - 9
Q - 10
K - 11
A - 12
It takes 4 bits to code the hand, and 4 bits each to code the cards. You can code an entire hand in 24 bits.
A royal flush would be 1001 1100 1011 1010 1001 1000 (0x9CBA98)
A 7-high straight would be 0100 0101 0100 0011 0010 0001 (0x454321)
Two pair, 10s and 5s (and an ace) would be 0010 1000 1000 0011 0011 1100 (0x28833C)
I assume you have logic that will figure out what hand you have. In that, you've probably written code to arrange the cards in left-to-right order. So a royal flush would be arranged as [A,K,Q,J,10]. You can then construct the number that represents the hand using the following logic:
int handValue = HandType; (i.e. 0 for high card, 7 for Four of a kind, etc.)
for each card
handValue = (handValue << 4) + cardValue (i.e. 0 for 2, 9 for Jack, etc.)
The result will be a unique value for each hand, and you're sure that a Flush will always beat a Straight and a king-high Full House will beat a 7-high Full House, etc.
Normalizing the hand
The above algorithm depends on the poker hand being normalized, with the most important cards first. So, for example, the hand [K,A,10,J,Q] (all of the same suit) is a royal flush. It's normalized to [A,K,Q,J,10]. If you're given the hand [10,Q,K,A,J], it also would be normalized to [A,K,Q,J,10]. The hand [7,4,3,2,4] is a pair of 4's. It will be normalized to [4,4,7,3,2].
Without normalization, it's very difficult to create a unique integer value for every hand and guarantee that a pair of 4's will always beat a pair of 3's.
Fortunately, sorting the hand is part of figuring out what the hand is. You could do that without sorting, but sorting five items takes a trivial amount of time and it makes lots of things much easier. Not only does it make determining straights easier, it groups common cards together, which makes finding pairs, triples, and quadruples easier.
For straights, flushes, and high card hands, all you need to do is sort. For the others, you have to do a second ordering pass that orders by grouping. For example a full house would be xxxyy, a pair would be xxabc, (with a, b, and c in order), etc. That work is mostly done for you anyway, by the sort. All you have to do is move the stragglers to the end.
As you have found, if you add together the values of the cards in the way you have proposed then you can get ambiguities.
100000 + 4^1 + 6^2 + 11^1 + 13^1 = 100064
100000 + 3^1 + 4^1 + 7^2 + 8^1 = 100064
However, addition is not quite the right tool here. You are already using ^ which means you're partway there. Use multiplication instead and you can avoid ambiguities. Consider:
100000 + (4^1 * 6^2 * 11^1 * 13^1)
100000 + (3^1 * 4^1 * 7^2 * 8^1)
This is nearly correct, but there are still ambiguities (for example 2^4 = 4^2). So, reassign new (prime!) values to each card:
Ace => 2
3 => 3
4 => 5
5 => 7
6 => 11
...
Then, you can multiply the special prime values of each card together to produce a unique value for every possible hand. Add in your value for type of hand (pair, full house, flush, etc) and use that. You may need to increase the magnitude of your hand type values so they stay out of the way of the card value composite.
The highest value for a card will be 14, assuming you let non-face cards keep their value (2..10), then J=11, QK, A=14.
The purpose of the scoring would be to differentiate between hands in a tie-breaking scenario. That is, "pair" vs. "pair." If you detect a different hand configuration ("two pair"), that puts the scores into separate groups.
You should carefully consult your requirements. I suspect that at least for some hands, the participating cards are more important than non-participating cards. For example, does a pair of 4's with a 7-high beat a pair of 3's with a queen-high? (Is 4,4,7,3,2 > 3,3,Q,6,5?) The answer to this should determine an ordering for the cards in the hand.
Given you have 5 cards, and the values are < 16, convert each card to a hexadecimal digit: 2..10,JQKA => 2..ABCDE. Put the cards in order, as determined above. For example, 4,4,7,3,2 will probably become 4,4,7,3,2. Map those values to hex, and then to an integer value: "0x44732" -> 0x44732.
Let your combo scores be multiples of 0x100000, to ensure that no card configuration can promote a hand into a higher class, then add them up.

Generating number within range with equal probability with dice

I've been thinking about this but can't seem to figure it out. I need to pick a random integer between 1 to 50 (inclusive) in such a way that each of the integer in it would be equally likely. I will have to do this using a 8 sided dice and a 15 sided dice.
I've read somewhat similar questions related to random number generators with dices but I am still confused. I think it is somewhere along the line of partitioning the numbers into sets. Then, I would roll a die, and then, depending on the outcome, decide which die to roll again.
Can someone help me with this?
As a simple - not necessarily "optimal" solution, roll the 8 sided die, then the 15 sided:
8 sided 15 sided 1..50 result
1 or 2 1..15 1..15
3 or 4 1..15 16..30 (add 15 to 15-sided roll)
5 or 6 1..15 31..45 (add 30 to 15-sided roll)
7 or 8 1..5 46..50 (add 45 to 15-sided roll)
7 or 8 6..15 start again / reroll both dice
lets say you have two functions: d8(), which returns a number from 0 to 7, and d15(), which returns a number from 0 to 14. You want to write a d50() that returns a number from 0 to 49.
Of all the simple ways, this one is probably the most efficient in terms of how many dice you have to roll, and something like this will work for all combinations of dice you have and dice you want:
int d50()
{
int result;
do
{
result = d8()*8+d8(); //random from 0 to 63
} while(result >=50);
return result;
}
If you want really constant time, you can do this:
int d50()
{
int result = d15();
int result = result*15+d15(); //0 to 225
int result = result*8+d8(); //0 to 1799
return result/36; //integer division rounds down
}
This way combines dice until the number of possibilities (1800) is evenly divisible by 50, so the same number of possibilities correspond to each result. This works OK in this case, but doesn't work if the prime factors of the dice you have (2, 3, and 5 in this case), don't cover the factors of the dice you want (2, 5)
I think that you can consider each dice result as a subdivision of a bigger interval. So throwing one 8 sided dice you choose one out the 8 major interval that divide your range of value. Throwing a 15 sided dice means selecting one out the 15 sub-interval and so on.
Considering that 15 = 3*5, 8 = 2*2*2 and 50 = 2*5*5 you can choose 36 = 3*3*2*2 as an handy multiple of 50 so that:
15*15*8 = 50*36 = 1800
You can even think of expressing the numbers from 0 to 1799 in base 15 and choose ramdomly the three digits:
choice = [0-7]*15^2 + [0-14]*15^1 + [0-14]*15^0
So my proposal, with a test of the distribution, is (in the c++ language):
#include <iostream>
#include <random>
#include <map>
int main() {
std::map<int, int> hist;
int result;
std::random_device rd;
std::mt19937 gen(rd()); // initialiaze the random generator
std::uniform_int_distribution<> d8(0, 7); // istantiate the dices
std::uniform_int_distribution<> d15(0, 14);
for (int i = 0; i < 20000; ++i) { // make a lot of throws...
result = d8(gen) * 225;
result += d15(gen) * 15; // add to result
result += d15(gen);
++hist[ result / 36 + 1]; // count each result
}
for (auto p : hist) { // show the occurences of each result
std::cout << p.first << " : " << p.second << '\n';
}
return 0;
}
The output should be something like this:
1 : 387
2 : 360
3 : 377
4 : 393
5 : 402
...
48 : 379
49 : 378
50 : 420

check overflow when multiply with 3 by bitwise

I have problem how to solve this one, Iam thinking about return
int product = 3 * n;
return (!n || product/n == 3);
however, I cant use those operators.
/*
* Overflow detection of 3*n
* Input is positive
* Example: overflow( 10 ) = 0
* Example: overlfow( 1<<30 ) = 1
* Legal ops: & | >> << ~
* Max ops: 10
*
* Number of X86 instructions:
*/
int overflow_3( int n ) {
return 2;
}
The condition is equivalent to checking whether x is larger than MAX_INT / 3, that is, x > 0x2aaaaaaa. Since x is known to be nonnegative, we know that the top bit is zero and thus we can check the condition as follows:
unsigned overflow(unsigned x) {
return (x + 0x55555555) >> 31;
}
There are two possible options for a number to overflow when multiplied by 3.
Let's look at X3 multiplication. There are two actions:
1. Shift left by 1 leaves the leftmost bit set. This could only happen if the near leftmost (i.e the 30) bit is set
2. Shift left by 1 leaves the leftmost bit unset. However the following addition of the original number results in having the bits set. This could only happen if the 29 bit is set (since it is the only one that will become the 30 after the shift) and if either the 28 or the 27 bit is set (since they can overflow to the 30 bit). However the 27 but by itself being set is not enough (since we need the 26 bit to be set, or the 25th and 24th) and etc.
So basically you need a loop here. However since loops are not allowed I would use recursion. So:
int overflow_3(int n){
return n >> 30 || (n >> 29 && overflow_3( (n & ( (1 << 29) - 1)) << 2 ) );
}

algorithm to sum up a list of numbers for all combinations

I have a list of numbers and I want to add up all the different combinations.
For example:
number as 1,4,7 and 13
the output would be:
1+4=5
1+7=8
1+13=14
4+7=11
4+13=17
7+13=20
1+4+7=12
1+4+13=18
1+7+13=21
4+7+13=24
1+4+7+13=25
Is there a formula to calculate this with different numbers?
A simple way to do this is to create a bit set with as much bits as there are numbers.
In your example 4.
Then count from 0001 to 1111 and sum each number that has a 1 on the set:
Numbers 1,4,7,13:
0001 = 13=13
0010 = 7=7
0011 = 7+13 = 20
1111 = 1+4+7+13 = 25
Here's how a simple recursive solution would look like, in Java:
public static void main(String[] args)
{
f(new int[] {1,4,7,13}, 0, 0, "{");
}
static void f(int[] numbers, int index, int sum, String output)
{
if (index == numbers.length)
{
System.out.println(output + " } = " + sum);
return;
}
// include numbers[index]
f(numbers, index + 1, sum + numbers[index], output + " " + numbers[index]);
// exclude numbers[index]
f(numbers, index + 1, sum, output);
}
Output:
{ 1 4 7 13 } = 25
{ 1 4 7 } = 12
{ 1 4 13 } = 18
{ 1 4 } = 5
{ 1 7 13 } = 21
{ 1 7 } = 8
{ 1 13 } = 14
{ 1 } = 1
{ 4 7 13 } = 24
{ 4 7 } = 11
{ 4 13 } = 17
{ 4 } = 4
{ 7 13 } = 20
{ 7 } = 7
{ 13 } = 13
{ } = 0
The best-known algorithm requires exponential time. If there were a polynomial-time algorithm, then you would solve the subset sum problem, and thus the P=NP problem.
The algorithm here is to create bitvector of length that is equal to the cardinality of your set of numbers. Fix an enumeration (n_i) of your set of numbers. Then, enumerate over all possible values of the bitvector. For each enumeration (e_i) of the bitvector, compute the sum of e_i * n_i.
The intuition here is that you are representing the subsets of your set of numbers by a bitvector and generating all possible subsets of the set of numbers. When bit e_i is equal to one, n_i is in the subset, otherwise it is not.
The fourth volume of Knuth's TAOCP provides algorithms for generating all possible values of the bitvector.
C#:
I was trying to find something more elegant - but this should do the trick for now...
//Set up our array of integers
int[] items = { 1, 3, 5, 7 };
//Figure out how many bitmasks we need...
//4 bits have a maximum value of 15, so we need 15 masks.
//Calculated as:
// (2 ^ ItemCount) - 1
int len = items.Length;
int calcs = (int)Math.Pow(2, len) - 1;
//Create our array of bitmasks... each item in the array
//represents a unique combination from our items array
string[] masks = Enumerable.Range(1, calcs).Select(i => Convert.ToString(i, 2).PadLeft(len, '0')).ToArray();
//Spit out the corresponding calculation for each bitmask
foreach (string m in masks)
{
//Get the items from our array that correspond to
//the on bits in our mask
int[] incl = items.Where((c, i) => m[i] == '1').ToArray();
//Write out our mask, calculation and resulting sum
Console.WriteLine(
"[{0}] {1}={2}",
m,
String.Join("+", incl.Select(c => c.ToString()).ToArray()),
incl.Sum()
);
}
Outputs as:
[0001] 7=7
[0010] 5=5
[0011] 5+7=12
[0100] 3=3
[0101] 3+7=10
[0110] 3+5=8
[0111] 3+5+7=15
[1000] 1=1
[1001] 1+7=8
[1010] 1+5=6
[1011] 1+5+7=13
[1100] 1+3=4
[1101] 1+3+7=11
[1110] 1+3+5=9
[1111] 1+3+5+7=16
Here is a simple recursive Ruby implementation:
a = [1, 4, 7, 13]
def add(current, ary, idx, sum)
(idx...ary.length).each do |i|
add(current + [ary[i]], ary, i+1, sum + ary[i])
end
puts "#{current.join('+')} = #{sum}" if current.size > 1
end
add([], a, 0, 0)
Which prints
1+4+7+13 = 25
1+4+7 = 12
1+4+13 = 18
1+4 = 5
1+7+13 = 21
1+7 = 8
1+13 = 14
4+7+13 = 24
4+7 = 11
4+13 = 17
7+13 = 20
If you do not need to print the array at each step, the code can be made even simpler and much faster because no additional arrays are created:
def add(ary, idx, sum)
(idx...ary.length).each do |i|
add(ary, i+1, sum + ary[i])
end
puts sum
end
add(a, 0, 0)
I dont think you can have it much simpler than that.
Mathematica solution:
{#, Total##}& /# Subsets[{1, 4, 7, 13}] //MatrixForm
Output:
{} 0
{1} 1
{4} 4
{7} 7
{13} 13
{1,4} 5
{1,7} 8
{1,13} 14
{4,7} 11
{4,13} 17
{7,13} 20
{1,4,7} 12
{1,4,13} 18
{1,7,13} 21
{4,7,13} 24
{1,4,7,13} 25
This Perl program seems to do what you want. It goes through the different ways to choose n items from k items. It's easy to calculate how many combinations there are, but getting the sums of each combination means you have to add them eventually. I had a similar question on Perlmonks when I was asking How can I calculate the right combination of postage stamps?.
The Math::Combinatorics module can also handle many other cases. Even if you don't want to use it, the documentation has a lot of pointers to other information about the problem. Other people might be able to suggest the appropriate library for the language you'd like to you.
#!/usr/bin/perl
use List::Util qw(sum);
use Math::Combinatorics;
my #n = qw(1 4 7 13);
foreach my $count ( 2 .. #n ) {
my $c = Math::Combinatorics->new(
count => $count, # number to choose
data => [#n],
);
print "combinations of $count from: [" . join(" ",#n) . "]\n";
while( my #combo = $c->next_combination ){
print join( ' ', #combo ), " = ", sum( #combo ) , "\n";
}
}
You can enumerate all subsets using a bitvector.
In a for loop, go from 0 to 2 to the Nth power minus 1 (or start with 1 if you don't care about the empty set).
On each iteration, determine which bits are set. The Nth bit represents the Nth element of the set. For each set bit, dereference the appropriate element of the set and add to an accumulated value.
ETA: Because the nature of this problem involves exponential complexity, there's a practical limit to size of the set you can enumerate on. If it turns out you don't need all subsets, you can look up "n choose k" for ways of enumerating subsets of k elements.
PHP: Here's a non-recursive implementation. I'm not saying this is the most efficient way to do it (this is indeed exponential 2^N - see JasonTrue's response and comments), but it works for a small set of elements. I just wanted to write something quick to obtain results. I based the algorithm off Toon's answer.
$set = array(3, 5, 8, 13, 19);
$additions = array();
for($i = 0; $i < pow(2, count($set)); $i++){
$sum = 0;
$addends = array();
for($j = count($set)-1; $j >= 0; $j--) {
if(pow(2, $j) & $i) {
$sum += $set[$j];
$addends[] = $set[$j];
}
}
$additions[] = array($sum, $addends);
}
sort($additions);
foreach($additions as $addition){
printf("%d\t%s\n", $addition[0], implode('+', $addition[1]));
}
Which will output:
0
3 3
5 5
8 8
8 5+3
11 8+3
13 13
13 8+5
16 13+3
16 8+5+3
18 13+5
19 19
21 13+8
21 13+5+3
22 19+3
24 19+5
24 13+8+3
26 13+8+5
27 19+8
27 19+5+3
29 13+8+5+3
30 19+8+3
32 19+13
32 19+8+5
35 19+13+3
35 19+8+5+3
37 19+13+5
40 19+13+8
40 19+13+5+3
43 19+13+8+3
45 19+13+8+5
48 19+13+8+5+3
For example, a case for this could be a set of resistance bands for working out. Say you get 5 bands each having different resistances represented in pounds and you can combine bands to sum up the total resistance. The bands resistances are 3, 5, 8, 13 and 19 pounds. This set gives you 32 (2^5) possible configurations, minus the zero. In this example, the algorithm returns the data sorted by ascending total resistance favoring efficient band configurations first, and for each configuration the bands are sorted by descending resistance.
This is not the code to generate the sums, but it generates the permutations. In your case:
1; 1,4; 1,7; 4,7; 1,4,7; ...
If I have a moment over the weekend, and if it's interesting, I can modify this to come up with the sums.
It's just a fun chunk of LINQ code from Igor Ostrovsky's blog titled "7 tricks to simplify your programs with LINQ" (http://igoro.com/archive/7-tricks-to-simplify-your-programs-with-linq/).
T[] arr = …;
var subsets = from m in Enumerable.Range(0, 1 << arr.Length)
select
from i in Enumerable.Range(0, arr.Length)
where (m & (1 << i)) != 0
select arr[i];
You might be interested in checking out the GNU Scientific Library if you want to avoid maintenance costs. The actual process of summing longer sequences will become very expensive (more-so than generating a single permutation on a step basis), most architectures have SIMD/vector instructions that can provide rather impressive speed-up (I would provide examples of such implementations but I cannot post URLs yet).
Thanks Zach,
I am creating a Bank Reconciliation solution. I dropped your code into jsbin.com to do some quick testing and produced this in Javascript:
function f(numbers,ids, index, sum, output, outputid, find )
{
if (index == numbers.length){
var x ="";
if (find == sum) {
y= output + " } = " + sum + " " + outputid + " }<br/>" ;
}
return;
}
f(numbers,ids, index + 1, sum + numbers[index], output + " " + numbers[index], outputid + " " + ids[index], find);
f(numbers,ids, index + 1, sum, output, outputid,find);
}
var y;
f( [1.2,4,7,13,45,325,23,245,78,432,1,2,6],[1,2,3,4,5,6,7,8,9,10,11,12,13], 0, 0, '{','{', 24.2);
if (document.getElementById('hello')) {
document.getElementById('hello').innerHTML = y;
}
I need it to produce a list of ID's to exclude from the next matching number.
I will post back my final solution using vb.net
v=[1,2,3,4]#variables to sum
i=0
clis=[]#check list for solution excluding the variables itself
def iterate(lis,a,b):
global i
global clis
while len(b)!=0 and i<len(lis):
a=lis[i]
b=lis[i+1:]
if len(b)>1:
t=a+sum(b)
clis.append(t)
for j in b:
clis.append(a+j)
i+=1
iterate(lis,a,b)
iterate(v,0,v)
its written in python. the idea is to break the list in a single integer and a list for eg. [1,2,3,4] into 1,[2,3,4]. we append the total sum now by adding the integer and sum of remaining list.also we take each individual sum i.e 1,2;1,3;1,4. checklist shall now be [1+2+3+4,1+2,1+3,1+4] then we call the new list recursively i.e now int=2,list=[3,4]. checklist will now append [2+3+4,2+3,2+4] accordingly we append the checklist till list is empty.
set is the set of sums and list is the list of the original numbers.
Its Java.
public void subSums() {
Set<Long> resultSet = new HashSet<Long>();
for(long l: list) {
for(long s: set) {
resultSet.add(s);
resultSet.add(l + s);
}
resultSet.add(l);
set.addAll(resultSet);
resultSet.clear();
}
}
public static void main(String[] args) {
// this is an example number
long number = 245L;
int sum = 0;
if (number > 0) {
do {
int last = (int) (number % 10);
sum = (sum + last) % 9;
} while ((number /= 10) > 0);
System.err.println("s = " + (sum==0 ? 9:sum);
} else {
System.err.println("0");
}
}

Resources