Optimal algorithm for encoding data within pre-existing data

Optimal algorithm for encoding data within pre-existing data - algorithm

Say we have some existing random data, uniformly distributed, that's been written to some medium. Lets also say that the writing of 1 bits is a destructive action, while for 0 bits it is non-destructive. This could be analogous to punching holes in a punch card or burning a fuse in a circuit to hard-code a 1. After this initial write has already taken place, lets say we now wish to write additional data to this medium, with some special encoding that will enable us to retrieve this new data in the future, with no special knowledge of pre-existing data that was already destructively written to the medium. The pre-existing data itself does not need to remain retrievable- only the new data.
Also assume that the new data is itself random, or has at least has already been compressed to be effectively random.
Obviously we cannot expect to exceed ~50% of the capacity of the original storage, since only roughly half will be writable. The trouble is trying to push the efficiency as close to this limit as possible.
I have two fairly simple encodings, with seemingly reasonable efficiencies, however they do not appear to be optimal. For the ease of explaining, I will assume the medium is a paper tape with holes (denoting a 1) or gaps (denoting a 0) at regular intervals.
Encoding A
Encode 1 bits with a gap at an even offset along the tape, and 0 bits with a gap at an odd offset.
Offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Tape: G G H H H H H G H H H G H H H H G H H H
| | | | |
New Data: 1 0 0 0 1
Encoding B
Encode 1 bits with a gap followed by another gap, and 0 bits with a gap followed by a hole.
Offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Tape: H G G H H H H G H H H G G H H H G H G H
\/ \/ \/ \/ \/
New Data: 1 0 1 0 0
Both of these, on average, can encode 25% of the original storage capacity's worth of data. This seems pretty reasonable, at least to me- only half of the tape to begin with were gaps, and any holes we write lose information (since we cannot know if they pre-existed or not when we later attempt to decode), so a quarter- 25%. Seems at first like it could be optimal.
However, what's bothering me is that it seems like this is not in fact the case. I have a somewhat more complex encoding that consistently breaks this threshold.
Encoding C
At its core, we encode 1 bits as runs of holes of odd length, and 0 bits as runs of holes of even length. So HHG would represent 0 while HG or HHHG would represent 1.
This by itself is not enough to cross the 25% threshold however, so we add one optimization. When we see two successive gaps after a run of holes and its terminating gap, we treat those gaps as a 0. The logic behind this is that if we had wanted to encode a 1, we could have just punched out the first of the two gaps, producing HG, encoding a 1. Since we did not do so, we can assume that we needed to encode a 0 instead.
With this optimization, encoding C reaches a stable 26.666 +/- 0.005 storage efficiency, consistently above the theoretical 25%. I also have a handful of really tricky peephole optimizations that I can tack onto this encoding, which I suspect would push its efficiency up to 28-30%, but it feels like I'm already overthinking this.
Encoding D?
This isn't really an encoding, but one last thought I had, which can improve any of the above encodings. Consider an encoding E(x), and some deterministic + reversible transformation T(x) which mutates the data in some arbitrary fashion. Say we prepend a 0 bit to our data-to-be-encoded, and encode it- E('0' + DATA). Say we also mutate our data, prepend a 1 bit, and then encode it- E('1' + T(DATA)). We could then compare the two, see which happened to encode more data (greater efficiency), and choose that one. The improvement would be small overall, hinging on statistical variation, and we did sacrifice a bit as an indicator, but as long as the data is large enough, the average savings will outweigh the single bit lost. This could be generalized- partitioning the data into a few partitions and choosing between 2 or more encodings, whichever happened to randomly fit best, but that's beside the point. Overall, the improvement would be small, but it should still be an straight improvement- indicating that the base encoding E(x) is not optimal.
To recap- I'm looking for the most efficient (in the average case) lossless encoding for writing data to a medium that's already been semi-destructively (1's only) written to with pure random data. I really believe the optimal solution is somewhere between 30-50% efficiency, but I'm struggling. I hope someone can either share what the optimal encoding is, or shed some light on the relevant literature / theory around this topic.
Side note: In the process of trying to get a better understanding of this problem, I tried to create a less efficient algorithm that bounds the worst-case encoding efficiency anywhere above 0%, but failed. It seems like no matter the encoding I tried, even if half of the storage were guaranteed to be writable, in an astronomically unlikely event: the ordering of the pre-existing data can ensure that we're unable to encode even a single bit of the new data. This isn't really a concern for the actual problem statement, since I'm concerned with the average-case efficiency, but this was unsettling.

I suspect that the expected capacity approaches 50% in the limit as the
number of bits n → ∞.
The encoding algorithm that I have in mind uses linear algebra over the
finite field F2. Ahead of time, choose ε > 0 and a random
matrix A of dimension (1 − ε) n/2 × n. To decode a vector x, if x is not
all ones, then return the matrix-vector product A (1 − x); otherwise,
fail. To encode a vector b, use Gaussian elimination to solve for
nonzero x′ in A′ (1 − x′) = b, where A′ omits the columns corresponding
to preexisting one bits and x′ omits the rows corresponding to
preexisting one bits. If there is no solution, punch a lace
card.
I don’t have time to write and verify a formal proof, but my intuition
is that, in taking sub-matrices of A, the probability that we encounter
linear dependence decreases very fast as ε edges away from zero.
I implemented a simulation of a more practical version of this algorithm
in C++ below (which could be extended to an encoder/decoder pair without
too much trouble). Instead of being all or nothing, it determines the
longest prefix that it can control and uses the block as a 6-bit length
followed by that amount of data, realizing ≈38% of the original storage.
The length prefix is eating about 9% here (we had control of ≈47% of the
bits), but it approaches 0% in the large-n limit. Even with 64-bit
blocks, you could do better by Huffman coding the lengths.
(You might wonder what happens if we can’t even control all of the
length bits. We can set things up so that the lace card decodes to
length 0, which implies that it is skipped, then punch a lace card
whenever that has to happen.)
#include <algorithm>
#include <cstddef>
#include <iostream>
#include <limits>
#include <optional>
#include <random>
#include <vector>
namespace {
static std::random_device g_dev;
class Vector {
public:
explicit Vector() = default;
static Vector Random() {
return Vector{std::uniform_int_distribution<Bits>{
0, std::numeric_limits<Bits>::max()}(g_dev)};
}
Vector &operator^=(const Vector v) {
bits_ ^= v.bits_;
return *this;
}
bool operator[](const std::size_t i) const { return bits_ & (Bits{1} << i); }
private:
using Bits = unsigned long long;
explicit Vector(const Bits bits) : bits_{bits} {}
Bits bits_ = 0;
};
class Matrix {
public:
static Matrix Random(const std::size_t num_rows) {
Matrix a;
a.rows_.reserve(num_rows);
for (std::size_t i = 0; i < num_rows; ++i) {
a.rows_.push_back(Vector::Random());
}
return a;
}
Matrix RandomSubsetOfRows(const double p) const {
Matrix a;
for (const Vector row : rows_) {
if (std::bernoulli_distribution{p}(g_dev)) {
a.rows_.push_back(row);
}
}
return a;
}
std::size_t NumControllablePrefixBits() && {
for (std::size_t j = 0; true; ++j) {
const auto pivot = [&]() -> std::optional<std::size_t> {
for (std::size_t i = j; i < rows_.size(); ++i) {
if (rows_[i][j]) {
return i;
}
}
return std::nullopt;
}();
if (!pivot) {
return j;
}
std::swap(rows_[j], rows_[*pivot]);
for (std::size_t i = 0; i < rows_.size(); ++i) {
if (i != j && rows_[i][j]) {
rows_[i] ^= rows_[j];
}
}
}
}
private:
std::vector<Vector> rows_;
};
} // namespace
int main() {
static constexpr std::size_t kBlocks = 10000;
static constexpr std::size_t kLogBlockSize = 6;
static constexpr std::size_t kBlockSize = 1 << kLogBlockSize;
std::size_t num_usable_bits = 0;
const Matrix a = Matrix::Random(kBlockSize);
for (int i = 0; i < kBlocks; ++i) {
num_usable_bits +=
a.RandomSubsetOfRows(0.5).NumControllablePrefixBits() - kLogBlockSize;
}
std::cout << static_cast<double>(num_usable_bits) / (kBlocks * kBlockSize)
<< "\n";
}

I implemented the full algorithm in Python. I augment the matrix by an identity matrix, like a matrix inversion algorithm, but zero out the columns that correspond to holes.
import math
import random
log_n = 10
n = log_n + ((1 << log_n) - 1)
matrix = [random.randrange(1 << n) for j in range(n)]
def decode(data):
text = 0
for j in range(n):
if (data & (1 << j)) == 0:
text ^= matrix[j]
return text
def invert(existing_data):
sub = [matrix[j] for j in range(n) if (existing_data & (1 << j)) == 0]
inverse = [1 << j for j in range(n) if (existing_data & (1 << j)) == 0]
i = 0
while True:
for k in range(i, len(sub)):
if sub[k] & (1 << i):
break
else:
return inverse[:i]
sub[i], sub[k] = sub[k], sub[i]
inverse[i], inverse[k] = inverse[k], inverse[i]
for k in range(len(sub)):
if k != i and (sub[k] & (1 << i)):
sub[k] ^= sub[i]
inverse[k] ^= inverse[i]
i += 1
def encode(inverse, text):
data = ~((~0) << n)
for i in range(len(inverse)):
if text & (1 << i):
data ^= inverse[i]
return data
def test():
existing_data = random.randrange(1 << n)
inverse = invert(existing_data)
payload_size = max(len(inverse) - log_n, 0)
payload = random.randrange(1 << payload_size)
text = payload_size ^ (payload << log_n)
data = encode(inverse, text)
assert (existing_data & ~data) == 0
decoded_text = decode(data)
decoded_payload_size = decoded_text & (~((~0) << log_n))
decoded_payload = (decoded_text >> log_n) & (~((~0) << payload_size))
assert payload_size == decoded_payload_size
assert payload == decoded_payload
return payload_size / n
print(sum(test() for i in range(100)))

Related

Iterate binary numbers with the same quantity of ones (or zeros) in random order

I need to generate binary numbers with the same quantity of ones (or zeros) in random order.
Does anyone know any efficient algorithm for fixed-length binary numbers?
Example for 2 ones and 4 digits (just to be more clear):
1100
1010
1001
0110
0101
0011
UPDATE
Random order w/o repetitions is significant. Sequence of binary numbers required, not single permutation.

If you have enough memory to store all the possible bit sequences, and you don't mind generating them all before you have the first result, then the solution would be to use some efficient generator to produce all possible sequences into a vector and then shuffle the vector using the Fisher-Yates shuffle. That's easy and unbiased (as long as you use a good random number generator to do the shuffle) but it can use a lot of memory if n is large, particularly if you are not sure you will need to complete the iteration.
But there are a couple of solutions which do not require keeping all the possible words in memory. (C implementations of the two solutions follow the text.)
1. Bit shuffle an enumeration
The fastest one (I think) is to first generate a random shuffle of bit values, and then iterate over the possible words one at a time applying the shuffle to the bits of each value. In order to avoid the complication of shuffling actual bits, the words can be generated in a Gray code order in which only two bit positions are changed from one word to the next. (This is also known as a "revolving-door" iteration because as each new 1 is added, some other 1 must be removed.) This allows the bit mask to be updated rapidly, but it means that successive entries are highly correlated, which may be unsuitable for some purposes. Also, for small values of n the number of possible bit shuffles is very limited, so there will not be a lot of different sequences produced. (For example, for the case where n is 4 and k is 2, there are 6 possible words which could be sequenced in 6! (720) different ways, but there are only 4! (24) bit-shuffles. This could be ameliorated slightly by starting the iteration at a random position in the sequence.)
It is always possible to find a Gray code. Here's an example for n=6, k=3: (The bold bits are swapped at each step. I wanted to underline them but for some inexplicable reason SO allows strikethrough but not underline.)
111000 010110 100011 010101
101100 001110 010011 001101
011100 101010 001011 101001
110100 011010 000111 011001
100110 110010 100101 110001
This sequence can be produced by a recursive algorithm similar to that suggested by #JasonBoubin -- the only difference is that the second half of each recursion needs to be produced in reverse order -- but it's convenient to use a non-recursive version of the algorithm. The one in the sample code below comes from Frank Ruskey's unpublished manuscript on Combinatorial Generation (Algorithm 5.7 on page 130). I modified it to use 0-based indexing, as well as adding the code to keep track of the binary representations.
2. Randomly generate an integer sequence and convert it to combinations
The "more" random but somewhat slower solution is to produce a shuffled list of enumeration indices (which are sequential integers in [0, n choose k)) and then find the word corresponding to each index.
The simplest pseudo-random way to produce a shuffled list of integers in a contiguous range is to use a randomly-chosen Linear Congruential Generator (LCG). An LCG is the recursive sequence xi = (a * xi-1 + c) mod m. If m is a power of 2, a mod 4 is 1 and c mod 2 is 1, then that recursion will cycle through all 2m possible values. To cycle through the range [0, n choose k), we simply select m to be the next larger power of 2, and then skip any values which are not in the desired range. (That will be fewer than half the values produced, for obvious reasons.)
To convert the enumeration index into an actual word, we perform a binomial decomposition of the index based on the fact that the set of n choose k words consists of n-1 choose k words starting with a 0 and n-1 choose k-1 words starting with a 1. So to produce the ith word:
if i < n-1 choose k we output a 0 and then the ith word in the set of n-1 bit words with k bits set;
otherwise, we output a 1 and then subtract n-1 choose k from i as the index into the set of n-1 bit words with k-1 bits set.
It's convenient to precompute all the useful binomial coefficients.
LCGs suffer from the disadvantage that they are quite easy to predict after the first few terms are seen. Also, some of the randomly-selected values of a and c will produce index sequences where successive indices are highly correlated. (Also, the low-order bits are always quite non-random.) Some of these problems could be slightly ameliorated by also applying a random bit-shuffle to the final result. This is not illustrated in the code below but it would slow things down very little and it should be obvious how to do it. (It basically consists of replacing 1UL<<n with a table lookup into the shuffled bits).
The C code below uses some optimizations which make it a bit challenging to read. The binomial coefficients are stored in a lower-diagonal array:
row
index
[ 0] 1
[ 1] 1 1
[ 3] 1 2 1
[ 6] 1 3 3 1
[10] 1 4 6 4 1
As can be seen, the array index for binom(n, k) is n(n+1)/2 + k, and if we have that index, we can find binom(n-1, k) by simply subtracting n, and binom(n-1, k-1) by subtracting n+1. In order to avoid needing to store zeros in the array, we make sure that we never look up a binomial coefficient where k is negative or greater than n. In particular, if we have arrived at a point in the recursion where k == n or k == 0, we can definitely know that the index to look up is 0, because there is only one possible word. Furthermore, index 0 in the set of words with some n and k
will consist precisely of n-k zeros followed by k ones, which is the n-bit binary representation of 2k-1. By short-cutting the algorithm when the index reaches 0, we can avoid having to worry about the cases where one of binom(n-1, k) or binom(n-1, k-1) is not a valid index.
C code for the two solutions
Gray code with shuffled bits
void gray_combs(int n, int k) {
/* bit[i] is the ith shuffled bit */
uint32_t bit[n+1];
{
uint32_t mask = 1;
for (int i = 0; i < n; ++i, mask <<= 1)
bit[i] = mask;
bit[n] = 0;
shuffle(bit, n);
}
/* comb[i] for 0 <= i < k is the index of the ith bit
* in the current combination. comb[k] is a sentinel. */
int comb[k + 1];
for (int i = 0; i < k; ++i) comb[i] = i;
comb[k] = n;
/* Initial word has the first k (shuffled) bits set */
uint32_t word = 0;
for (int i = 0; i < k; ++i) word |= bit[i];
/* Now iterate over all combinations */
int j = k - 1; /* See Ruskey for meaning of j */
do {
handle(word, n);
if (j < 0) {
word ^= bit[comb[0]] | bit[comb[0] - 1];
if (--comb[0] == 0) j += 2;
}
else if (comb[j + 1] == comb[j] + 1) {
word ^= bit[comb[j + 1]] | bit[j];
comb[j + 1] = comb[j]; comb[j] = j;
if (comb[j + 1] == comb[j] + 1) j += 2;
}
else if (j > 0) {
word ^= bit[comb[j - 1]] | bit[comb[j] + 1];
comb[j - 1] = comb[j]; ++comb[j];
j -= 2;
}
else {
word ^= bit[comb[j]] | bit[comb[j] + 1];
++comb[j];
}
} while (comb[k] == n);
}
LCG with enumeration index to word conversion
static const uint32_t* binom(unsigned n, unsigned k) {
static const uint32_t b[] = {
1,
1, 1,
1, 2, 1,
1, 3, 3, 1,
1, 4, 6, 4, 1,
1, 5, 10, 10, 5, 1,
1, 6, 15, 20, 15, 6, 1,
// ... elided for space
};
return &b[n * (n + 1) / 2 + k];
}
static uint32_t enumerate(const uint32_t* b, uint32_t r, unsigned n, unsigned k) {
uint32_t rv = 0;
while (r) {
do {
b -= n;
--n;
} while (r < *b);
r -= *b;
--b;
--k;
rv |= 1UL << n;
}
return rv + (1UL << k) - 1;
}
static bool lcg_combs(unsigned n, unsigned k) {
const uint32_t* b = binom(n, k);
uint32_t count = *b;
uint32_t m = 1; while (m < count) m <<= 1;
uint32_t a = 4 * randrange(1, m / 4) + 1;
uint32_t c = 2 * randrange(0, m / 2) + 1;
uint32_t x = randrange(0, m);
while (count--) {
do
x = (a * x + c) & (m - 1);
while (x >= *b);
handle(enumerate(b, x, n, k), n);
}
return true;
}
Note: I didn't include the implementation of randrange or shuffle; code is readily available. randrange(low, lim) produces a random integer in the range [low, lim); shuffle(vec, n) randomly shuffles the integer vector vecof length n.
Also, the the loop calls handle(word, n) for each generated word. That must must be replaced with whatever is to be done with each combination.
With handle defined as a function which does nothing, gray_combs took 150 milliseconds on my laptop to find all 40,116,600 28-bit words with 14 bits set. lcg_combs took 5.5 seconds.

Integers with exactly k bits set are easy to generate in order.
You can do that, and then change the order by applying a bit-permutation to the results (see below), for example here's a randomly generated 16-bit (you should pick one with the right number of bits, based on the word size not on the number of set bits) bit-permutation (not tested):
uint permute(uint x) {
x = bit_permute_step(x, 0x00005110, 1); // Butterfly, stage 0
x = bit_permute_step(x, 0x00000709, 4); // Butterfly, stage 2
x = bit_permute_step(x, 0x000000a1, 8); // Butterfly, stage 3
x = bit_permute_step(x, 0x00005404, 1); // Butterfly, stage 0
x = bit_permute_step(x, 0x00000231, 2); // Butterfly, stage 1
return x;
}
uint bit_permute_step(uint x, uint m, int shift) {
uint t;
t = ((x >> shift) ^ x) & m;
x = (x ^ t) ^ (t << shift);
return x;
}
Generating the re-ordered sequence is easy:
uint i = (1u << k) - 1;
uint max = i << (wordsize - k);
do
{
yield permute(i);
i = nextPermutation(i);
} while (i != max);
yield permute(i); // for max
Where nextPermutation comes from the linked question,
uint nextPermutation(uint v) {
uint t = (v | (v - 1)) + 1;
uint w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
return w;
}
The bit-permutation should be chosen as a random permutation (eg take 0..(wordsize-1) and shuffle) and then converted to bfly masks (I used programming.sirrida.de/calcperm.php), not as randomly generated bfly masks.

I think you can use Heap's algorithm. This algorithm generates all possible permutations of n objects. Just create simple array and use algorithm for generating all possible permutations.
This algorithm is non effective if you want to iterate over binary numbers with BINARY operations. For binary operations you can use LFSR.
LFSR is a simple method for iteration over all numbers. I think you can do some simple modifications for generations fixed size zeros numbers with LFSR.

How about this solution in Python which does permutations?
from itertools import permutations
fixed_length = 4
perms = [''.join(p) for p in permutations('11' + '0' * (fixed_length - 2))]
unique_perms = set(perms)
This would return the numbers as strings, easily convertible with int(num, 2).
As for efficiency, running this took 0.021 milliseconds on my machine.

You can modify the general permutation algorithm to work with binary. Here's an implementation in C++:
#include<iostream>
#include<string>
#include<iostream>
void binaryPermutation(int ones, int digits, std::string current){
if(digits <= 0 && ones <= 0){
std::cout<<current<<std::endl;
}
else if(digits > 0){
if(ones > 0){
binaryPermutation(ones-1, digits-1, current+"1");
}
binaryPermutation(ones, digits-1, current+"0");
}
}
int main()
{
binaryPermutation(2, 4, "");
return 0;
}
This code outputs the following:
1100
1010
1001
0110
0101
0011
You can modify it to store these outputs in a collection or do something other than simply print them.

Most efficient way to evaluate a binary scalar product mod 2

I am currently performing Fourier transforms for some physics problem, and a huge bottleneck of my algorithm comes from the evaluation of a scalar product modulo 2.
For a given integer N, I have to represent all the numbers in binary up to 2^N-1.
For each of these numbers, represented as a binary vector (e.g. 15 = 2^3 + 2^2 +2+2^0 = (1,1,1,1,0,...,0)) I have to evaluate its scalar products with all numbers from 0 to 2^N-1 in binary form modulo 2.
(for example, the scalar product 1.15 =(1,0,0,...,0).(1,1,1,1,0,...,0)=1*1+1*0+...=1 mod 2)
Note that the components are kept in binary form during the reducing modulo 2
(1,1).(1,1)=1*1+1*1 and not 1*1+2*2
This is basically 2^(2N) scalar products that I have to perform and reduce modulo 2.
I am having difficulty to get more than N = 18.
I was wondering whether some clever mathematical trick can be used to greatly reduce the time spent doing them.
I was thinking of some kind of recursion (i.e. saving results for N in a file and deduce the results for N+1) but I am not sure this would help. Indeed, with this recursion, knowing the results for N, I could cut the vector for N+1 corresponding to the N part plus an additional digit, but then at each scalar product, instead of evaluating the scalar product, I would have to tell my computer to go and read a big file (because I probably wouldn't be able to keep it all in dynamic memory), which is probably time-consuming, perhaps more than the ~20 multiplications I have to perform for each of the products.
Is there any known optimized number-theoretical algorithm allowing the evaluation of such a scalar product modulo 2 very quickly ? Are there any rules or ideas I am not aware of that I could exploit ?
Sorry for the terrible formatting, I just can't get LateX to work in here.

The sum of the product of corresponding bits, modulo 2, will be equal to the number of 1 bits in the AND of the two numbers, modulo 2.
As you can get the binary representation of a number easily, it might not be necessary to actually create an array of bits for them, but just use the integer data type in your programming language, which allows for at least 32 bits. Many languages offer bit operators, such as a AND (&) and XOR (^).
Counting the 1 bits in a number can be done with the variable-precision SWAR algorithm.
Here is program in Python that calculates this product modulo 2 for 2 numbers:
def numberOfSetBits(i):
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
def product(a, b):
return numberOfSetBits(a & b) % 2
Instead of counting the bits with numberOfSetBits, you could fold the bits together with XORs, first the 16 most significant bits with the 16 least significant bits, then of that result the 8 most significant with the 8 least significant bits, until you have one bit left. Again in Python:
def bitParity(i):
i = (i >> 16) ^ i
i = (i >> 8) ^ i
i = (i >> 4) ^ i
i = (i >> 2) ^ i
i = (i >> 1) ^ i
return i % 2
def product(a, b):
return bitParity(a & b)

If you change the order that you are evaluating these pairs (a matrix of size 2n x 2n), then you can efficiently figure out which products-mod-2 change in each row of your evaluation.
Using Gray code, you can iterate over each value from 0 ... 2n-1 in a special order where only 1 bit of the outer-loop value changes each time. You can store 1 bit for each value from 0 ... 2n-1 representing the previous row's product-mod-2 values, and then change it based on whether the changing bit has any effect, which it only does when the corresponding bit in the other (inner loop) number is 1 (if it's 0 then the binary AND will be 0 no matter what the value of the other bit).
In C:
int N = 5;
int max = (1 << N) - 1;
unsigned char* prev = calloc((1 << N) / 8, 1);
// for the first row all the products will be zero, so start at row 1
for(int a = 1; a <= max; a++)
{
int grey = a ^ (a >> 1); // compute the grey code
int prev_grey = (a - 1) ^ ((a - 1) >> 1);
int changed_bit = grey ^ prev_grey;
for(int b = 0; b <= max; b++)
{
// the product will be changed only if b has a 1 at the same place
// (otherwise it will be 0 regardless)
if(b & changed_bit)
{
prev[b >> 3] ^= (1 << (b & 7));
}
int mod = (prev[b >> 3] & (1 << (b & 7))) != 0;
printf("mod value of %d and %d is %d\n", grey, b, mod);
}
}
The inner loop can be optimized even more because you can easily figure out which values of b have a non-zero value in the position of the changed bit: for example if it's in position 10 then there will be runs of 1024 in a row of 0 then 1 etc. So you know that you have 1024 values where the product-mod-2 is the same as in the previous row etc. It's not clear to me if this helps you though because I don't know what you are doing with these products.
The inner loop could also be unrolled (e.g. 32 or 64 times) so that you don't read and write to the prev array each time, but rather process blocks of 32 or 64 bits at a time.

Select one number at a time between 0 & 10 billion in random order

Problem
I have a need to pick one unique random number at a time between 0 and 10,000,000,000 and do it till all numbers are selected. Essentially the behavior I need is a pre-built stack/queue with 10 billion numbers in random order, with no ability to push new items into it.
Not so good ways to solve:
There's no shortage of inefficient ways in my brain. Such as,
persist generated numbers and check newly generated random number is already used, at some point this gets us into indefinite wait before a usable number is produced.
Persist all possible numbers in a table and pop a random row and maintain new row count for next pick etc. Not sure if this is good or bad.
Questions:
Are there other deterministic ways besides storing all possible combinations and using random?
Like maintaining windows of available numbers and randomly select a window first and randomly select a number within that window etc. eg: like this
If not, what is the best type to store numbers in reasonably small amount of space?
50+% of numbers wont fit in a 32 bit (int), 64 bit (long) is waste. Cos largest number fits in 34 bits, wasting 30 bits per number (>37GB total).
If this problem hasn't been solved already.
What is a good data structure for storing & picking a random spot and quickly adjust the structure for next pick to be fast?
***Sorry for the ambiguity. The largest selectable number is 9,999,999,999 and smallest selectable is 1.

You ask: "Are there other deterministic ways besides storing all possible combinations and using random?"
Yes there is: Encryption. Encryption with a given key guarantees a unique result for unique inputs since it is reversible. Each key defines a one-to-one permutation of the possible inputs. You need an encryption of inputs in the range [1..10e9]. To deal with something that big you need 34 bit numbers, which go up to 17,179,869,183.
There is no standard 34 bit encryption. Depending on how much security you need, and how fast you need the numbers, you can either write your own simple, fast, insecure four-round Feistel Cipher or else for something slower and more secure use Hasty Pudding cipher in 34 bit mode.
With either solution, if the first encryption gives a result outside the range, just encrypt the result again until the new result is within the range you want. The one-to-one property ensures that the final result of the chain of encryptions will be unique.
To generate a sequence of unique random-seeming numbers just encrypt 0, 1, 2, 3, 4, ... in order with the same key. Encryption guarantees that the results will be unique for that key. If you record how far you have got, then you can generate more unique numbers later, up to your 10 billion limit.

As mentioned by AChampion in the comments, you could use a Linear Congruential generator.
Your modulo (m) value will be 10 billion. In order to get a full period (all values in the range appear before the series repeats) you need to choose the a and c constants to satisfy certain criteria. m and c need to be relatively prime and a - 1 needs to be divisible by the prime factors of m (which are just 2 and 5) and also by 4 (since 10 billion is divisible by 4).
If you just come up with a single set of constants, you will only have one possible series and the numbers will always occur in the same order. However you can easily randomly generate constants that satisfy the criteria. To test for relative primality of c and m, just test if c is divisible by 2 and 5, since these are the only prime factors of m (see first condition of coprimality test here)
Simple sketch in Python:
import random
m = 10000000000
a = 0
c = 0
r = 0
def setupLCG():
global a, c, r
# choose value of c that is 0 < c < m and relatively prime to m
c = 5
while ((c % 5 == 0) or (c % 2 == 0)):
c = random.randint(1, m - 1)
# choose value of a that is 0 < a <= m and a - 1 is divisible by
# prime factors of m, and 4
a = 4
while ((((a - 1) % 4) != 0) or (((a - 1) % 5) != 0)):
a = random.randint(1, m)
r = random.randint(0, m - 1)
def rand():
global m, a, c, r
r = (a*r + c) % m
return r
random.seed()
setupLCG()
for i in range(1000):
print rand() + 1
This approach won't give the full possibility of 10000000000! possible combinations, but it will still be on the order of 1019, which is quite a lot. It does have a few other issues (e.g. alternates even and odd values). You could mix it up a bit by having a small pool of numbers, adding a number from the sequence to it each time and randomly drawing one out.

Similar to what rossum has suggested, you can use invertible integer hash function, which uniquely maps an integer in [0,2^k) to another integer in the same range. For your particular problem, you choose k=34 (2^34=16 billion) and reject any number above 10 billion. Here is a complete implementation:
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
uint64_t hash_64(uint64_t key, uint64_t mask)
{
key = (~key + (key << 21)) & mask; // key = (key << 21) - key - 1;
key = key ^ key >> 24;
key = ((key + (key << 3)) + (key << 8)) & mask; // key * 265
key = key ^ key >> 14;
key = ((key + (key << 2)) + (key << 4)) & mask; // key * 21
key = key ^ key >> 28;
key = (key + (key << 31)) & mask;
return key;
}
int main(int argc, char *argv[])
{
uint64_t i, shift, mask, max = 10000ULL;
char *dummy;
if (argc > 1) max = strtol(argv[1], &dummy, 10);
for (shift = 0; 1ULL<<shift <= max; ++shift) {}
mask = (1ULL<<shift) - 1;
for (i = 0; i <= mask; ++i) {
uint64_t x = hash_64(i, mask);
x = hash_64(x, mask);
x = hash_64(x, mask); // apply multiple times to increase randomness
if (x > max || x == 0) continue;
printf("%llu\n", x);
}
return 0;
}
This should give you number [0,10000000000] in random order.

For the range 1-999,999,999,999 is equivalent 0-999,999,999,998 (just add 1). Given the definition of LCG then you can implement this:
import functools as ft
import itertools as it
import operator as op
from sympy import primefactors, nextprime
def LCG(m, seed=0):
factors = set(primefactors(m))
a = ft.reduce(op.mul, factors)+1
assert(m%4 != 0 or (m%4 == 0 and (a-1)%m == 0))
c = nextprime(max(factors)+1)
assert(c < m)
x = seed
while True:
x = (a * x + c) % m
yield x
# Check the first 10,000,000 for duplicates
>>> x = list(it.islice(LCG(999999999999), 10000000))
>>> len(x) == len(set(x))
True
# Last 10 numbers
>>> x[-10:]
[99069910838, 876847698522, 765736597318, 99069940559, 210181061577,
432403293706, 99069970280, 543514424631, 99069990094, 99070000001]
I've taken a couple of shortcuts for the context of this question as the asserts should be replaced with handling code, currently it would just fail if those asserts were False

I'm not aware of any truly random methods of picking the numbers without storing a list of the numbers already picked. You could do some sort of linear hashing algorithm, and then pass the numbers 0 to n through it (repeating when your hash returns a value above 10000000000), but this wouldn't be truly random.
If you are to store the numbers, you might consider doing it via a bitmask. To pick quickly in the bitmask, you would likely keep a tree, where each leaf would represent the number of free bits in the corresponding 32 bytes, the branches above that would list the number of free bits in the corresponding 2K entries, and so forth. You then have O(log(n)) time to find your next entry, and O(log(n)) time to claim a bit (as you have to update the tree). It would require something to the order of 2n bits to store as well.

You definitely don't need to store all the numbers.
If you want a perfect set of the numbers from 1 to 10B each exactly once, there are two options that I see: as hinted at by the others, use a 34-bit LCG or Galois LFSR or XOR-shift that generates a sequence of numbers from 1 to 17B or so, then throw out the ones over 10B. I am not aware of any specifically 34-bit functions for this, but I'm sure someone is.
Option 2, if you can spare 1.25 GB of memory, is to create a bitmap that stores only the information that a certain number has been chosen, then use Floyd's Algorithm to get the numbers, which would be fast and give you much better quality numbers (in fact, it would work just fine with hardware RNGs).
Option 3, if you can live with a rare but occasional mistake (duplicate or never-selected number), replace the bitmap with a Bloom filter and save memory.

If predictability is not a concern, you can generate quickly using XOR operations. Suppose you want to generate a random sequence of unique numbers with n bits (34 in your case):
1- take a seed number on n bits. This number, K, can be considered as a seed that you can change each time you run a new experiment.
2- Use a counter from 0 upward
3- Each time XOR the counter with K : next = counter xor K; counter++;
To limit the range to 10 Billion, which is not a power of two, you will need to do rejection.
The obvious drawback is predictability. In step 3, you can do a prior transposition on the bytes of the counter, for example inverse the order of the bytes (like when you transform from little-endian to big endian). This would yield some improvement concerning the predictability of the next number.
Finally I have to admit that this answer can be considered as a particular implementation of encryption which was mentioned in the answer of #rossum, but it's more specific and probably fastest.

Incredibly slow but it should work. Completely random
using System;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
namespace ConsoleApplication1
{
class Program
{
static Random random = new Random();
static void Main()
{
const long start = 1;
const long NumData = 10000000000;
const long RandomNess = NumData;
var sz = Marshal.SizeOf(typeof(long));
var numBytes = NumData * sz;
var filePath = Path.GetTempFileName();
using (var stream = new FileStream(filePath, FileMode.Create))
{
// create file with numbers in order
stream.Seek(0, SeekOrigin.Begin);
for (var index = start; index < NumData; index++)
{
var bytes = BitConverter.GetBytes(index);
stream.Write(bytes, 0, sz);
}
for (var iteration = 0L; iteration < RandomNess; iteration++)
{
// get 2 random longs
var item1Index = LongRandom(0, NumData - 1, random);
var item2Index = LongRandom(0, NumData - 1, random);
// allocate room for data
var data1ByteArray = new byte[sz];
var data2ByteArray = new byte[sz];
// read the first value
stream.Seek(item1Index * sz, SeekOrigin.Begin);
stream.Read(data1ByteArray, 0, sz);
// read the second value
stream.Seek(item2Index * sz, SeekOrigin.Begin);
stream.Read(data2ByteArray, 0, sz);
var item1 = BitConverter.ToInt64(data1ByteArray, 0);
var item2 = BitConverter.ToInt64(data2ByteArray, 0);
Debug.Assert(item1 < NumData);
Debug.Assert(item2 < NumData);
// swap the values
stream.Seek(item1Index * sz, SeekOrigin.Begin);
stream.Write(data2ByteArray, 0, sz);
stream.Seek(item2Index * sz, SeekOrigin.Begin);
stream.Write(data1ByteArray, 0, sz);
}
}
File.Delete(filePath);
Console.WriteLine($"{numBytes}");
}
static long LongRandom(long min, long max, Random rand)
{
long result = rand.Next((int)(min >> 32), (int)(max >> 32));
result = (result << 32);
result = result | rand.Next((int)min, (int)max);
return result;
}
}
}

Semi-reversible integer hash (please keep an open mind)

I need to explore the topic of integer hashes for a specific application. I have a few requirements:
integer to integer hash
"semi" reversibility. I know a hash will not be 1-1 reversible, so please try to understand what I have in mind in terms of an n-1 hash. Let's say I have an original domain of numbers 0...n that I hash into a smaller domain 0...k. If I hash using function f(n) = k, then I want something reversible in the sense that I also have an "inverse" g(k) = {n1,n2,n3, ..., nj} are all the possible domain members that hash to k
reasonably even and "randomish" distribution
for my "inverse" function g, I have a tight bound on the size of the set returned, and this size is roughly the same for any given k
fast integer hash
To explain the application a bit here... I am operating in a very memory restricted environment. I intend to not allow collisions. That is, if there is a collision with an existing value in the table, the insert operation just fails. That's ok. I don't need every insert to succeed. I am ready to make that trade off in favor of space and speed. Now the key thing is this, when I store the value in the table I need to absolutely minimize the number of bits represented. What I am hoping for is basically:
If I hash to value k, I can immediately narrow down what I store to a small subset of the original domain. If the hash is "semi reversible" and if I can enumerate all the possible domain elements hashing to k, then I can order them and assign the ordinal to each possibility. Then I would like to store that much smaller ordinal rather than the original value which will require hopefully many fewer bits. Then I should be able to fully reverse this by enumerating to the ith possibility for stored ordinal i.
The importance of the tight bound on the size of he inverse set g(k) is because I need to know how many bits I need to allocate for each ordinal, and I want to keep things relatively simple by allocating the same number of bits to each table entry. Yes. I will probably be working on smaller than a byte values though. The original domain will be of a relatively small size to start with.
I'm interested in any of your thoughts and any examples anyone might have reference to. I think this should be doable, but I would like to get an idea of the range of possible solutions.
Thanks in advance!
Mark

Shuffle for desired distribution
Apply some bijection in the 0..(n-1) domain to shuffle things a bit. This would be particularly easy if n were a prime number, since in that case you could treat modulo arithmetic as a field, and perform all kinds of nice mathematical functions. One thing which might distribute numbers evenly enough for your needs might be multiplication by a fixed number c, followed by modulo:
a ↦ (c*a) mod n
You'll have to choose c such that it is coprime to n, i.e. gcd(c,n)=1. If n is a prime number, then this is trivial as long as c≠0, and if n were a power of two, then any odd number would still suffice. This coprimality condition ensures the existence of another number d which is the inverse of c, i.e. it satisfies c*d ≡ 1 (mod n) so that multiplication by d will undo the effect of multiplication by c. You might e.g. use BigInteger.modInverse in Java or Wolfram Alpha to compute this number.
If your n is a power of two, then you can avoid the modulo operation (and the time that would take), and instead do simple bit mask operations. But even for other values of n, you can sometimes come up with schemes that avoid a generic division operation. When you choose c (and d with it), you can do so in a way that both c and d have only few non-zero bits. Then multiplication can likely be expressed in terms of bit shifts and additions. Your optimizing compiler should take care of that for you, as long as you make sure these numbers are compile-time constants.
Here is an example which makes this optimization explicit. Note that writing code this way should not be neccessary: usually it should be enough to write things like (25*a)&1023.
// n = 1024
// c = 25 = 16+8+1
// d = 41 = 32+8+1
static unsigned shuffle(unsigned a) {
return (a + (a << 3) + (a << 4)) & 1023;
}
static unsigned unshuffle(unsigned a) {
return (a + (a << 3) + (a << 5)) & 1023;
}
Another shuffling approach which would work for the case that n is a power of two is using some combinations of bit shifts, masks and xors to modify the value. This could be combined with the above multiplication approach, either doing bit twiddling before or after the multiplication, or even both. Making a choice depends very much on the actual distribution of values.
Split and store
The resulting value, still in the range 0..(n-1), can be split into two values: one part which is in the range 0..(k-1) and will be called lo, and another in the range 0..(ceil(n/k)-1) which I'll call hi.
lo = a mod k
hi = floor(a/k)
If k is a power of two, you can obtain lo using a bit mask, and hi using a bit shift. You could then use hi to denote a hash bucket, and lo to signify a value to store in that bucket. All values with the same hi value would collide, but their lo part would help retrieving the value actually stored.
If you want to recognize unoccupied slots of your hash map, then you should ensure that one specific lo value (e.g. zero) will be reserved for this purpose in every slot. If you cannot achieve this reservation in the original set of values, then you might want to choose k as a power of two minus one, so that you can store the value of k itself to denote empty cells. Or you could swap the meaning of hi and lo, such that you could tune the value of n to leave out some values. I'll use this in the example below.
Inversion
To invert this whole thing, you take the key hi and the stored value lo, combine them to a value a=k*hi+lo in the range 0..(n-1), then undo the initial shuffling to go back to your original value.
Example
This example is geared to avoid all multiplication and division. It distributes n=4032 values over k=64 slots, with n/k=63 different values plus one special empty value possible for each slot. It does shuffling using c=577 and d=1153.
unsigned char bitseq[50] = { 0 };
int store(unsigned a) {
unsigned b, lo, hi, bitpos, byteno, cur;
assert(a < 4032); // a has range 0 .. 0xfbf
// shuffle
b = (a << 9) + (a << 6) + a + 64; // range 0x40 ..0x237dbf
b = (b & 0xfff) + ((b & 0xfff000) >> 6); // range 0x40 .. 0x9d7f
b = (b & 0xfff) + ((b & 0xfff000) >> 6); // range 0x40 .. 0x11ff
b = (b & 0xfff) + ((b & 0xfff000) >> 6); // range 0x40 .. 0xfff
b -= 64; // range 0x00 .. 0xfbf
// split
lo = b & 63; // range 0x00 .. 0x3f
hi = b >> 6; // range 0x00 .. 0x3e
// access bit sequence
bitpos = (lo << 2) + (lo << 1); // range 0x00 .. 0x17a
byteno = (bitpos >> 3); // range 0x00 .. 0x30
bitpos &= 7; // range 0x00 .. 0x7
cur = (((bitseq[byteno + 1] << 8) | bitseq[byteno]) >> bitpos) & 0xff;
if (cur != 0) return 1; // slot already occupied.
cur = hi + 1; // range 0x01 .. 0x3f means occupied
bitseq[byteno] |= (cur << bitpos) & 0xff;
bitseq[byteno + 1] |= ((cur << bitpos) & 0xff00) >> 8;
return 0; // slot was free, value stored
}
void list_all() {
unsigned b, lo, hi, bitpos, byteno, cur;
for (lo = 0; lo != 64; ++lo) {
// access bit sequence
bitpos = (lo << 2) + (lo << 1);
byteno = (bitpos >> 3);
bitpos &= 7;
cur = (((bitseq[byteno + 1] << 8) | bitseq[byteno]) >> bitpos) & 0x3f;
if (cur == 0) continue;
// recombine
hi = cur - 1;
b = (hi << 6) | lo;
// unshuffle
b = (b << 10) + (b << 7) + b + 64;
b = (b & 0xfff) + ((b & 0xfff000) >> 6);
b = (b & 0xfff) + ((b & 0xfff000) >> 6);
b = (b & 0xfff) + ((b & 0xfff000) >> 6);
b -= 64;
// report
printf("%4d was stored in slot %2d using value %2d.\n", b, lo, cur);
}
}
As you can see, it is possible to avoid all multiplication and division operations, and all explicit modulo calls as well. Whether the resulting code has more performance than one using a single modulo call per invocation remains to be tested. The fact that you need up to three reduction steps to avoid a single modulo makes this rather costly.
You can watch a demo run of the above code.

There is no such thing as a free lunch.
If you have an even distribution then g(k1) will have n/k values for each k1. So you end up having to store k*n/k or n values, which happens to be the same number you started with.
You should probably be looking for compression algorithms rather than hash functions. It will improve your google charma.
That said, it is hard to suggest a compression algorithm without knowing the distribution of numbers. If it is truly random, then it will be hard to compress.

Algorithm for sampling without replacement?

I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.
I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?
For example (these data are just a simplified example):
Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5
For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.
To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.
Logic or pseudocode preferred.

Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.
void SampleWithoutReplacement
(
int populationSize, // size of set sampling from
int sampleSize, // size of each sample
vector<int> & samples // output, zero-offset indicies to selected items
)
{
// Use Knuth's variable names
int& n = sampleSize;
int& N = populationSize;
int t = 0; // total input records dealt with
int m = 0; // number of items selected so far
double u;
while (m < n)
{
u = GetUniform(); // call a uniform(0,1) random number generator
if ( (N - t)*u >= n - m )
{
t++;
}
else
{
samples[m] = t;
t++; m++;
}
}
}
There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.

A C++ working code based on the answer by John D. Cook.
#include <random>
#include <vector>
// John D. Cook, https://stackoverflow.com/a/311716/15485
void SampleWithoutReplacement
(
int populationSize, // size of set sampling from
int sampleSize, // size of each sample
std::vector<int> & samples // output, zero-offset indicies to selected items
)
{
// Use Knuth's variable names
int& n = sampleSize;
int& N = populationSize;
int t = 0; // total input records dealt with
int m = 0; // number of items selected so far
std::default_random_engine re;
std::uniform_real_distribution<double> dist(0,1);
while (m < n)
{
double u = dist(re); // call a uniform(0,1) random number generator
if ( (N - t)*u >= n - m )
{
t++;
}
else
{
samples[m] = t;
t++; m++;
}
}
}
#include <iostream>
int main(int,char**)
{
const size_t sz = 10;
std::vector< int > samples(sz);
SampleWithoutReplacement(10*sz,sz,samples);
for (size_t i = 0; i < sz; i++ ) {
std::cout << samples[i] << "\t";
}
return 0;
}

See my answer to this question Unique (non-repeating) random numbers in O(1)?. The same logic should accomplish what you are looking to do.

Inspired by #John D. Cook's answer, I wrote an implementation in Nim. At first I had difficulties understanding how it works, so I commented extensively also including an example. Maybe it helps to understand the idea. Also, I have changed the variable names slightly.
iterator uniqueRandomValuesBelow*(N, M: int) =
## Returns a total of M unique random values i with 0 <= i < N
## These indices can be used to construct e.g. a random sample without replacement
assert(M <= N)
var t = 0 # total input records dealt with
var m = 0 # number of items selected so far
while (m < M):
let u = random(1.0) # call a uniform(0,1) random number generator
# meaning of the following terms:
# (N - t) is the total number of remaining draws left (initially just N)
# (M - m) is the number how many of these remaining draw must be positive (initially just M)
# => Probability for next draw = (M-m) / (N-t)
# i.e.: (required positive draws left) / (total draw left)
#
# This is implemented by the inequality expression below:
# - the larger (M-m), the larger the probability of a positive draw
# - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
# - for (N-t) >> (M-m), we must get a very small u
#
# example: (N-t) = 7, (M-m) = 5
# => we draw the next with prob 5/7
# lets assume the draw fails
# => t += 1 => (N-t) = 6
# => we draw the next with prob 5/6
# lets assume the draw succeeds
# => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
# => we draw the next with prob 4/5
# lets assume the draw fails
# => t += 1 => (N-t) = 4
# => we draw the next with prob 4/4, i.e.,
# we will draw with certainty from now on
# (in the next steps we get prob 3/3, 2/2, ...)
if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
# no draw -- happens mainly for (N-t) >> (M-m) and/or high u
t += 1
else:
# draw t -- happens when (M-m) gets large and/or low u
yield t # this is where we output an index, can be used to sample
t += 1
m += 1
# example use
for i in uniqueRandomValuesBelow(100, 5):
echo i

When the population size is much greater than the sample size, the above algorithms become inefficient, since they have complexity O(n), n being the population size.
When I was a student I wrote some algorithms for uniform sampling without replacement, which have average complexity O(s log s), where s is the sample size. Here is the code for the binary tree algorithm, with average complexity O(s log s), in R:
# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
s=as.integer(size)
if (s>n) {
stop("Sample size is greater than the number of items to choose from")
}
# upv=integer(s) #level up edge is pointing to
leftv=integer(s) #left edge is poiting to; must be filled with zeros
rightv=integer(s) #right edge is pointig to; must be filled with zeros
samp=integer(s) #the sample
ordn=integer(s) #relative ordinal number
ordn[1L]=1L #initial value for the root vertex
samp[1L]=sample(n,1L)
if (s > 1L) for (j in 2L:s) {
curn=sample(n-j+1L,1L) #current number sampled
curordn=0L #currend ordinal number
v=1L #current vertice
from=1L #how have come here: 0 - by left edge, 1 - by right edge
repeat {
curordn=curordn+ordn[v]
if (curn+curordn>samp[v]) { #going down by the right edge
if (from == 0L) {
ordn[v]=ordn[v]-1L
}
if (rightv[v]!=0L) {
v=rightv[v]
from=1L
} else { #creating a new vertex
samp[j]=curn+curordn
ordn[j]=1L
# upv[j]=v
rightv[v]=j
break
}
} else { #going down by the left edge
if (from==1L) {
ordn[v]=ordn[v]+1L
}
if (leftv[v]!=0L) {
v=leftv[v]
from=0L
} else { #creating a new vertex
samp[j]=curn+curordn-1L
ordn[j]=-1L
# upv[j]=v
leftv[v]=j
break
}
}
}
}
return(samp)
}
The complexity of this algorithm is discussed in:
Rouzankin, P. S.; Voytishek, A. V. On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54.
http://dx.doi.org/10.1515/mcma.1999.5.1.39
If you find the algorithm useful, please make a reference.
See also:
P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209.
DOI: 10.1080/00207168408803438
Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127–140.
DOI: 10.1080/00207168208803304
In the last paper the authors use hash tables and claim that their algorithms have O(s) complexity. There is one more fast hash table algorithm, which will soon be implemented in pqR (pretty quick R):
https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

I wrote a survey of algorithms for sampling without replacement. I may be biased but I recommend my own algorithm, implemented in C++ below, as providing the best performance for many k, n values and acceptable performance for others. randbelow(i) is assumed to return a fairly chosen random non-negative integer less than i.
void cardchoose(uint32_t n, uint32_t k, uint32_t* result) {
auto t = n - k + 1;
for (uint32_t i = 0; i < k; i++) {
uint32_t r = randbelow(t + i);
if (r < t) {
result[i] = r;
} else {
result[i] = result[r - t];
}
}
std::sort(result, result + k);
for (uint32_t i = 0; i < k; i++) {
result[i] += i;
}
}

Another algorithm for sampling without replacement is described here.
It is similar to the one described by John D. Cook in his answer and also from Knuth, but it has different hypothesis: The population size is unknown, but the sample can fit in memory. This one is called "Knuth's algorithm S".
Quoting the rosettacode article:
Select the first n items as the sample as they become available;
For the i-th item where i > n, have a random chance of n/i of keeping it. If failing this chance, the sample remains the same. If
not, have it randomly (1/n) replace one of the previously selected n
items of the sample.
Repeat #2 for any subsequent items.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio