Understanding branch prediction efficiency

Understanding branch prediction efficiency - performance

I tried to measure branch prediction cost, I created a little program.
It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.
Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:
N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9
For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.
My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?
(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)
Here's the code:
#include <cstdint>
#include <iostream>
volatile uint64_t init[2] = { 314159165, 27182818 };
// volatile uint64_t init[2] = { 0, 0 };
volatile uint64_t one = 1;
uint64_t next(uint64_t s[2]) {
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;
}
int main() {
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];
uint64_t sum = 0;
#if 1
const int N = 16;
unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;
for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) {
for (int j=0; j<1<<N; j++) {
if (buffer[j]) {
sum += one;
}
}
}
#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++) {
if (next(s)&1) {
sum += one;
}
}
#endif
std::cout<<sum<<"\n";
}
(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)
Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):
401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:
N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9
Note: branch-misses (%) is calculated for 2^32 branches
As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.
The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.
The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):
(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

Related

Compressing a vector of positive integers (int32) that have a specific order

I'm trying to compress long vectors (their size ranges from 1 to 100 million elements). The vectors have positive integers with values ranging from 0 to 1 or 100 million (depending on the vector size). Hence, I'm using 32 bit integers to encompass the large numbers but that consumes too much storage.
The vectors have the following characteristic features:
All values are positive integers. Their range grows as the vector size grows.
Values are increasing but smaller numbers do intervene frequently (see the figure below).
None of the values before a specific index are larger than that index (Index starts at zero). For instance, none of the values that occur before the index of 6 are larger than 6. However, smaller values may repeat after that index. This holds true for the entire array.
I'm usually dealing with very long arrays. Hence, as the array length passes 1 million elements, the upcoming numbers are mostly large numbers mixed with previous reoccurring numbers. Shorter numbers usually re-occur more than larger numbers. New Larger numbers are added to the array as you pass through it.
Here is a sample of the values in the array: {initial padding..., 0, 1, 2, 3, 4, 5, 6, 4, 7, 4, 8, 9, 1, 10, ... later..., 1110, 11, 1597, 1545, 1392, 326, 1371, 1788, 541,...}
Here is a plot of a part of the vector:
What do I want? :
Because I'm using 32 bit integers this is wasting a lot of memory since smaller numbers that can be represented with less than 32 bit do repeat too. I want to compress this vector maximally to save memory (Ideally, by a factor of 3 because only a reduction by that amount or more will meet our needs!). What is the best compression algorithm to achieve that? Or is there away to take advantage of the array's characteristic features described above to reversibly convert the numbers in that array to 8 bit integers?
Things that I have tried or considered:
Delta encoding: This doesn't work here because the vector is not always increasing.
Huffman coding: Does not seem to help here since the range of unique numbers in the array is quite large, hence, the encoding table will be a large overhead.
Using variable Int encoding. i.e using 8 bit integers for smaller numbers and 16 bit for larger ones...etc. This has reduced the vector size to size*0.7 (not satisfactory since it doesn't take advantage of the specific characteristics described above)
I'm not quite sure if this method described in the following link is applicable to my data: http://ygdes.com/ddj-3r/ddj-3r_compact.html
I don't quite understand the method but it gives me the encouragement to try similar things because I think there is some order in the data that can be taken to its advantage.
For example, I tried to reassign any number(n) larger than 255 to n-255 so that I can keep the integers in 8 bit realm because I know that no number is larger than 255 before that index. However, I'm not able to distinguish the reassigned numbers with the repeated numbers... so this idea doesn't work unless doing some more tricks to reverse the re-assignments...
Here is the link to the fist 24000 elements of the data for those interested:
data
Any advice or suggestions are deeply appreciated. Thanks a lot in advance.
Edit1:
Here is a plot of the data after delta encoding. As you can see, it doesn't reduce the range!
Edit2:
I was hoping that I could find a pattern in the data that allows me to reversibly change the 32-bit vector to a single 8-bit vector but this seems very unlikely.
I have tried to decompose the 32-bit vector to 4 x 8-bit vectors, hoping that the decomposed vectors lend themselves to compression better.
Below are plots for the 4 vectors. Now their ranges are from 0-255.
What I did was to recursively divide each element in the vectors by 255 and store the reminder into another vector. To reconstruct the original array all I need to do is: ( ( (vec4*255) + vec3 )*255 + vec2 ) *255 + vec1...
As you can see, the last vector is all zeros for the current shown length of the data.. in fact, this should be zeros all the way to 2^24th element. This will be a 25% reduction if my total vector length was less than 16 million elements but since I'm dealing with much longer vectors this has a much smaller impact.
More importantly, the third vector seems also to have some compressible features as its values do increase by 1 after each 65,535 steps.
It does seem that now I can benefit from Huffman coding or variable bit encoding as suggested. Any suggestions that allows me to maximally compress this data are deeply appreciated.
Here I attached a bigger sample of the data if anyone is interested:
https://drive.google.com/file/d/10wO3-1j3NkQbaKTcr0nl55bOH9P-G1Uu/view?usp=sharing
Edit3:
I'm really thankful for all the given answers. I've learnt a lot from them. For those of you who are interested to tinker with a larger set of the data the following link has 11 million elements of a similar dataset (zipped 33MB)
https://drive.google.com/file/d/1Aohfu6II6OdN-CqnDll7DeHPgEDLMPjP/view
Once you unzip the data, you can use the following C++ snippet to read the data into a vector<int32_t>
const char* path = "path_to\compression_int32.txt";
std::vector<int32_t> newVector{};
std::ifstream ifs(path, std::ios::in | std::ifstream::binary);
std::istream_iterator<int32_t> iter{ ifs };
std::istream_iterator<int32_t> end{};
std::copy(iter, end, std::back_inserter(newVector));

It's easy to get better than a factor of two compression on your example data by using property 3, where I have taken property 3 to mean that every value must be less than its index, with the indices starting at 1. Simply use ceiling(log2(i)) bits to store the number at index i (where i starts at 1). For your first example with 24,977 values, that compresses it of 43% of the size of the vector using 32-bit integers.
The number of bits required depends only on the length of the vector, n. The number of bits is:
1 - 2ceiling(log2(n)) + n ceiling(log2(n))
As noted by Falk Hüffner, a simpler approach would be a fixed number of bits for all values of ceiling(log2(n)). A variable number of bits will always be less than that, but not much less than that for large n.
If it is common to have a run of zeros at the start, then compress those with a count. There are only a handful of runs of two or three numbers in the remainder, so run-length encoding won't help except for that initial run of zeros.
Another 2% or so (for large sets) could be shaved off using an arithmetic coding approach, considering each value at index k (indices starting at zero) to be a base k+1 digit of a very large integer. That would take ceiling(log2(n!)) bits.
Here is a plot of the compression ratios of the arithmetic coding, variable bits per sample coding, and fixed bits per sample coding, all ratioed to a representation with 32 bits for every sample (the sequence length is on a log scale):
The arithmetic approach requires multiplication and division on integers the length of the compressed data, which is monumentally slow for large vectors. The code below limits the size of the integers to 64 bits, at some cost to the compression ratio, in exchange for it being very fast. This code will give compression ratios about 0.2% to 0.7% more than arithmetic in the plot above, well below variable bits. The data vector must have the property that each value is non-negative
and that each value is less than its position (positions starting at one).
The compression effectiveness depends only on that property, plus a small reduction if there is an initial run of zeros.
There appears to be a bit more redundancy in the provided examples that this
compression approach does not exploit.
#include <vector>
#include <cmath>
// Append val, as a variable-length integer, to comp. val must be non-negative.
template <typename T>
void write_varint(T val, std::vector<uint8_t>& comp) {
while (val > 0x7f) {
comp.push_back(val & 0x7f);
val >>= 7;
}
comp.push_back(val | 0x80);
}
// Return the variable-length integer at offset off in comp, updating off to
// point after the integer.
template <typename T>
T read_varint(std::vector<uint8_t> const& comp, size_t& off) {
T val = 0, next;
int shift = 0;
for (;;) {
next = comp.at(off++);
if (next > 0x7f)
break;
val |= next << shift;
shift += 7;
}
val |= (next & 0x7f) << shift;
return val;
}
// Given the starting index i >= 1, find the optimal number of values to code
// into 64 bits or less, or up through index n-1, whichever comes first.
// Optimal is defined as the least amount of entropy lost by representing the
// group in an integral number of bits, divided by the number of bits. Return
// the optimal number of values in num, and the number of bits needed to hold
// an integer representing that group in len.
static void group_ar64(size_t i, size_t n, size_t& num, int& len) {
// Analyze all of the permitted groups, starting at index i.
double min = 1.;
uint64_t k = 1; // integer range is 0..k-1
auto j = i + 1;
do {
k *= j;
auto e = log2(k); // entropy of k possible integers
int b = ceil(e); // number of bits to hold 0..k-1
auto loss = (b - e) / b; // unused entropy per bit
if (loss < min) {
num = j - i; // best number of values so far
len = b; // bit length for that number
if (loss == 0.)
break; // not going to get any better
min = loss;
}
} while (j < n && k <= (uint64_t)-1 / ++j);
}
// Compress the data arithmetically coded as an incrementing base integer, but
// with a 64-bit limit on each integer. This puts values into groups that each
// fit in 64 bits, with the least amount of wasted entropy. Also compress the
// initial run of zeros into a count.
template <typename T>
std::vector<uint8_t> compress_ar64(std::vector<T> const& data) {
// Resulting compressed data vector.
std::vector<uint8_t> comp;
// Start with number of values to make the stream self-terminating.
write_varint(data.size(), comp);
if (data.size() == 0)
return comp;
// Run-length code the initial run of zeros. Write the number of contiguous
// zeros after the first one.
size_t i = 1;
while (i < data.size() && data[i] == 0)
i++;
write_varint(i - 1, comp);
// Compress the data into variable-base integers starting at index i, where
// each integer fits into 64 bits.
unsigned buf = 0; // output bit buffer
int bits = 0; // number of bits in buf (0..7)
while (i < data.size()) {
// Find the optimal number of values to code, starting at index i.
size_t num; int len;
group_ar64(i, data.size(), num, len);
// Code num values.
uint64_t code = 0;
size_t k = 1;
do {
code += k * data[i++];
k *= i;
} while (--num);
// Write code using len bits.
if (bits) {
comp.push_back(buf | (code << bits));
code >>= 8 - bits;
len -= 8 - bits;
}
while (len > 7) {
comp.push_back(code);
code >>= 8;
len -= 8;
}
buf = code;
bits = len;
}
if (bits)
comp.push_back(buf);
return comp;
}
// Decompress the result of compress_ar64(), returning the original values.
// Start decompression at offset off in comp. When done, off is updated to
// point just after the compressed data.
template <typename T>
std::vector<T> expand_ar64(std::vector<uint8_t> const& comp, size_t& off) {
// Will contain the uncompressed data to return.
std::vector<T> data;
// Get the number of values.
auto vals = read_varint<size_t>(comp, off);
if (vals == 0)
return data;
// Get the number of zeros after the first one, and write all of them.
auto run = read_varint<size_t>(comp, off) + 1;
auto i = run;
do {
data.push_back(0);
} while (--run);
// Extract the values from the compressed data starting at index i.
unsigned buf = 0; // input bit buffer
int bits = 0; // number of bits in buf (0..7)
while (i < vals) {
// Find the optimal number of values to code, starting at index i. This
// simply repeats the same calculation that was done when compressing.
size_t num; int len;
group_ar64(i, vals, num, len);
// Read len bits into code.
uint64_t code = buf;
while (bits + 8 < len) {
code |= (uint64_t)comp.at(off++) << bits;
bits += 8;
}
len -= bits; // bits to pull from last byte (1..8)
uint64_t last = comp.at(off++); // last byte
code |= (last & ((1 << len) - 1)) << bits;
buf = last >> len; // save remaining bits in buffer
bits = 8 - len;
// Extract num values from code.
do {
i++;
data.push_back(code % i);
code /= i;
} while (--num);
}
// Return the uncompressed data.
return data;
}

Solving every compression problem should begin with an analysis.
I looked at the raw data file containing the first 24976 values. The smallest value is 0 and the largest is 24950. The "slope" of the data is then around 1. However, It should decrease over time, if the maximum is, as told, only 33M#100M values. Assumption of slope=1 is then a bit pessimistic.
As for the distribution,
tr '[,]' '[\n]' <compression.txt | sort -n | uniq -c | sort -nr | head -n256
produces
164 0
131 8
111 1648
108 1342
104 725
103 11
91 1475
90 1446
82 21
82 1355
78 69
76 2
75 12
72 328
71 24
70 614
70 416
70 1608
70 1266
69 22
67 356
67 3
66 1444
65 19
65 1498
65 10
64 2056
64 16
64 1322
64 1182
63 249
63 1335
61 43
60 17
60 1469
59 33
59 3116
58 20
58 1201
57 303
55 5
55 4
55 2559
55 1324
54 1110
53 1984
53 1357
52 807
52 56
52 4321
52 2892
52 1
50 39
50 2475
49 1580
48 664
48 266
47 317
47 1255
46 981
46 37
46 3531
46 23
43 1923
43 1248
41 396
41 2349
40 7
39 6
39 54
39 4699
39 32
38 815
38 2006
38 194
38 1298
38 1284
37 44
37 1550
37 1369
37 1273
36 1343
35 61
35 3991
35 3606
35 1818
35 1701
34 836
34 27
34 264
34 241
34 1306
33 821
33 28
33 248
33 18
33 15
33 1017
32 9
32 68
32 53
32 240
32 1516
32 1474
32 1390
32 1312
32 1269
31 667
31 326
31 263
31 25
31 160
31 1253
30 3365
30 2082
30 18550
30 1185
30 1049
30 1018
29 73
29 487
29 48
29 4283
29 34
29 243
29 1605
29 1515
29 1470
29 1297
29 1183
28 980
28 60
28 302
28 242
28 1959
28 1779
28 161
27 811
27 51
27 36
27 201
27 1270
27 1267
26 979
26 50
26 40
26 3111
26 26
26 2425
26 1807
25 825
25 823
25 812
25 77
25 46
25 217
25 1842
25 1831
25 1534
25 1464
25 1321
24 730
24 66
24 59
24 427
24 355
24 1465
24 1299
24 1164
24 1111
23 941
23 892
23 7896
23 663
23 607
23 556
23 47
23 2887
23 251
23 1776
23 1583
23 1488
23 1349
23 1244
22 82
22 818
22 661
22 42
22 411
22 3337
22 3190
22 3028
22 30
22 2226
22 1861
22 1363
22 1301
22 1262
22 1158
21 74
21 49
21 41
21 376
21 354
21 2156
21 1688
21 162
21 1453
21 1067
21 1053
20 711
20 413
20 412
20 38
20 337
20 2020
20 1897
20 1814
20 17342
20 173
20 1256
20 1160
19 9169
19 83
19 679
19 4120
19 399
19 2306
19 2042
19 1885
19 163
19 1623
19 1380
18 805
18 79
18 70
18 6320
18 616
18 455
18 4381
18 4165
18 3761
18 35
18 2560
18 2004
18 1900
18 1670
18 1546
18 1291
18 1264
18 1181
17 824
17 8129
17 63
17 52
17 5138
as the most frequent 256 values.
It seems some values are inherently more common. When examined, those common values also seem to be distributed all over the data.
I propose the following:
Divide the data into blocks. For each block, send the actual value of the slope, so when coding each symbol we know its maximum value.
Code the common values in a block with statistical coding (Huffman etc.). In this case, the cutoff with an alphabet of 256 would be around 17 occurrences.
For less common values, we reserve a small part of the alphabet for sending the amount of bits in the value.
When we encounter a rare value, its bits are coded without statistical modeling. The topmost bit can be omitted, since we know it's always 1 (unless value is '0').
Usually the range of values to be coded is not a power-of-2. For example, if we have 10 choices, this requires 4 bits to code, but there are 6 unused bit patterns - sometimes we only need 3 bits. The first 6 choices we code directly with 3 bits. If it's 7 or 8, we send an extra bit to indicate if we meant 9 or 10.
Additionally, we could exclude any value that is directly coded from the list of possible values. Otherwise we have two ways to code the same value, which is redundant.

As I suggested in my comment you can represent your data as 8bit. There are simple ways on how to do it efficiently no need for modular arithmetics..
You can use union or pointers for this so for example in C++ if you have:
unsigned int data32[]={0,0,0,...};
unsigned char *data08=data32;
Or you can copy it to 4 BYTE array but that will be slower.
If you have to use modular arithmetics for any reasons then you might want to do it like this:
x &255
(x>> 8)&255
(x>>16)&255
(x>>24)&255
Now I have tried LZW on your new data and the compression ratio result without any data reordering (single LZW) was 81-82% (depending on dictionary size I suggest to use 10bit LZW dictionary) which is not as good as expected. So I reordered the data into 4 arrays (just like you did) so first array has lowest 8bits and last the highest. The results with 12 bit dictionary where:
ratio08: 144%
ratio08: 132%
ratio08: 26%
ratio08: 0%
total: 75%
The results with 10 bit dictionary where:
ratio08: 123%
ratio08: 117%
ratio08: 28%
ratio08: 0%
total: 67%
Showing that LZW is bad for lowest bytes (and with increasing size it will be worse for higher bytes too) So use it only for the higher BYTEs which would improve the compress ratio more.
However I expect huffman should lead to much better results so I computed entropy for your data:
H32 = 5.371071 , H/8 = 0.671384
H08 = 7.983666 , H/8 = 0.997958
H08 = 7.602564 , H/8 = 0.950321
H08 = 1.902525 , H/8 = 0.237816
H08 = 0.000000 , H/8 = 0.000000
total: 54%
meaning naive single huffman encoding would have compress ratio 67% and the separate 4 arrays would lead to 54% which is much better so in your case I would go for huffman encoding. After I implemented it here the result:
[Huffman]
ratio08 = 99.992%
ratio08 = 95.400%
ratio08 = 24.706%
ratio08 = 0.000%
total08 = 55.025%
ratio32 = 67.592%
Which closely matches the estimation by Shannon entropy as expected (not accounting the decoding table) ...
However with very big datasets I expect naive huffman will start to get slightly better than the separate 4x huffman ...
Also note that the result where truncated so those 0% are not zero but something less than 1% ...
[Edit1] 300 000 000 entries estimation
so to simulate the conditions for 300M 32bit numbers of yours I use 16bit numbers sub part of your data with similar "empty space" properties.
log2(300 000 000) = ~28
28/32 * 16 = 14
so I use only 2^14 16bit numbers which should have similar properties as your 300M 32 bit numbers The 8bit Huffman encoding leads to:
ratio08 = 97.980%
ratio08 = 59.534%
total08 = 78.757%
So I estimate 80% ratio between encoded/decoded sizes ~1.25 size reduction.
(Hope I did not screw something up with my assumptions).

The data you are dealing with is "nearly" sorted, so you can use that to great effect with delta encoding.
A simple approach is as follows:
Look for runs of data, denoted by R_i = (v,l,N) where l is the length of the run, N is the bit-depth needed to do delta encoding on the sorted run, and v is the value of the first element of the (sorted) run (needed for delta encoding.) The run itself then just needs to store 2 pieces of information for each entry in the run: the idx of each sorted element in the run and the delta. Note, to store the idx of each sorted element, only log_2(l) bits are needed per idx, where l is the length of the run.
The encoding works by attempting to find the least number of bits to fully encode the run when compared to the number of bytes used in its uncompressed form. In practice, this can be implemented by finding the longest run that is encoded for a fixed number of bytes per element.
To decode, simply decode run-by-run (in order) first decoding the delta coding/compression, then undoing the sort.
Here is some C++ code that computes the compression ratio that can be obtained using this scheme on the data sample you posted. The implementation takes a greedy approach in selecting the runs, it is possible slightly better results are available if a smarter approach is used.
#include <algorithm>
#include <cassert>
#include <cstdio>
#include <cstdlib>
#include <map>
#include <queue>
#include "data.h"
template <int _N, int _M> struct T {
constexpr static int N = _N;
constexpr static int M = _M;
uint16_t idx : N;
uint16_t delta : M;
};
template <int _N, int _M>
std::pair<int32_t, int32_t> best_packed_run_stats(size_t idx) {
const int N = 1 << _N;
const int M = 1 << _M;
static std::vector<int32_t> buffer(N);
if (idx + N >= data.size())
return {-1, 0};
std::copy(&data[idx], &data[idx + N], buffer.data());
std::sort(buffer.begin(), buffer.end());
int32_t run_len = 0;
for (size_t i = 1; i < N; ++i, ++run_len) {
auto delta = buffer[i] - buffer[i - 1];
assert(delta >= 0);
if (delta >= M) {
break;
}
}
int32_t savings = run_len * (sizeof(int32_t) - sizeof(T<_N, _M>)) -
1 // 1 byte to store bit-depth
- 2; // 2 bytes to store run length
return {savings, run_len};
}
template <class... Args>
std::vector<std::pair<int32_t, int32_t>> all_runs_stats(size_t idx) {
return {best_packed_run_stats<Args::N, Args::M>(idx)...};
}
int main() {
size_t total_savings = 0;
for (size_t i = 0; i < data.size(); ++i) {
auto runs =
all_runs_stats<T<2, 14>, T<4, 12>, T<8, 8>, T<12, 4>, T<14, 2>>(i);
auto best_value = *std::max_element(runs.begin(), runs.end());
total_savings += best_value.first;
i += best_value.second;
}
size_t uncomp_size = data.size() * sizeof(int32_t);
double comp_ratio =
(uncomp_size - (double)total_savings) / (double)uncomp_size;
printf("uncomp_size: %lu\n", uncomp_size);
printf("compression: %lf\n", comp_ratio);
printf("size: %lu\n", data.size());
}
Note, only certain fixed configurations of 16-bit representations of elements in a run are attempted. Because of this we should expect the best possible compression we can achieve is 50% (i.e. 4 bytes -> 2 bytes.) In reality, there is overhead.
This code when run on the data sample you supplied reports this compression ration:
uncomp_size: 99908
compression: 0.505785
size: 24977
which is very close to the theoretical limit of .5 for this compression algorithm.
Also, note, that this slightly beats out the Shannon entropy estimate reported in another answer.
Edit to address Mark Adler's comment below.
Re-running this compression on the larger data-set provided (compression2.txt) along with comparing to Mark Adler's approach here are the results:
uncomp_size: 2602628
compression: 0.507544
size: 650657
bit compression: 0.574639
Where bit compression is the compression ratio of Mark Adler's approach. As noted by others, compressing the bits of each entry will not scale well for large data, we should expect the ratio to get worse with n.
Meanwhile the delta + sorting compression described above maintains close to its theoretical best of .5.

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

Two different threads within a single process can share a common memory location by reading and/or writing to it.
Usually, such (intentional) sharing is implemented using atomic operations using the lock prefix on x86, which has fairly well-known costs both for the lock prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).
Here I'm interested in produced-consumer costs where a single thread P writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.
What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.
In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.

Okay, I couldn't find any authoritative source, so I figured I'd give it a go myself.
#include <pthread.h>
#include <sched.h>
#include <atomic>
#include <cstdint>
#include <iostream>
alignas(128) static uint64_t data[SIZE];
alignas(128) static std::atomic<unsigned> shared;
#ifdef EMPTY_PRODUCER
alignas(128) std::atomic<unsigned> unshared;
#endif
alignas(128) static std::atomic<bool> stop_producer;
alignas(128) static std::atomic<uint64_t> elapsed;
static inline uint64_t rdtsc()
{
unsigned int l, h;
__asm__ __volatile__ (
"rdtsc"
: "=a" (l), "=d" (h)
);
return ((uint64_t)h << 32) | l;
}
static void * consume(void *)
{
uint64_t value = 0;
uint64_t start = rdtsc();
for (unsigned n = 0; n < LOOPS; ++n) {
for (unsigned idx = 0; idx < SIZE; ++idx) {
value += data[idx] + shared.load(std::memory_order_relaxed);
}
}
elapsed = rdtsc() - start;
return reinterpret_cast<void*>(value);
}
static void * produce(void *)
{
do {
#ifdef EMPTY_PRODUCER
unshared.store(0, std::memory_order_relaxed);
#else
shared.store(0, std::memory_order_relaxed);
#enfid
} while (!stop_producer);
return nullptr;
}
int main()
{
pthread_t consumerId, producerId;
pthread_attr_t consumerAttrs, producerAttrs;
cpu_set_t cpuset;
for (unsigned idx = 0; idx < SIZE; ++idx) { data[idx] = 1; }
shared = 0;
stop_producer = false;
pthread_attr_init(&consumerAttrs);
CPU_ZERO(&cpuset);
CPU_SET(CONSUMER_CPU, &cpuset);
pthread_attr_setaffinity_np(&consumerAttrs, sizeof(cpuset), &cpuset);
pthread_attr_init(&producerAttrs);
CPU_ZERO(&cpuset);
CPU_SET(PRODUCER_CPU, &cpuset);
pthread_attr_setaffinity_np(&producerAttrs, sizeof(cpuset), &cpuset);
pthread_create(&consumerId, &consumerAttrs, consume, NULL);
pthread_create(&producerId, &producerAttrs, produce, NULL);
pthread_attr_destroy(&consumerAttrs);
pthread_attr_destroy(&producerAttrs);
pthread_join(consumerId, NULL);
stop_producer = true;
pthread_join(producerId, NULL);
std::cout <<"Elapsed cycles: " <<elapsed <<std::endl;
return 0;
}
Compile with the following command, replacing defines:
gcc -std=c++11 -DCONSUMER_CPU=3 -DPRODUCER_CPU=0 -DSIZE=131072 -DLOOPS=8000 timing.cxx -lstdc++ -lpthread -O2 -o timing
Where:
CONSUMER_CPU is the number of the cpu to run consumer thread on.
PRODUCER_CPU is the number of the cpu to run producer thread on.
SIZE is the size of the inner loop (matters for cache)
LOOPS is, well...
Here are the generated loops:
Consumer thread
400cc8: ba 80 24 60 00 mov $0x602480,%edx
400ccd: 0f 1f 00 nopl (%rax)
400cd0: 8b 05 2a 17 20 00 mov 0x20172a(%rip),%eax # 602400 <shared>
400cd6: 48 83 c2 08 add $0x8,%rdx
400cda: 48 03 42 f8 add -0x8(%rdx),%rax
400cde: 48 01 c1 add %rax,%rcx
400ce1: 48 81 fa 80 24 70 00 cmp $0x702480,%rdx
400ce8: 75 e6 jne 400cd0 <_ZL7consumePv+0x20>
400cea: 83 ee 01 sub $0x1,%esi
400ced: 75 d9 jne 400cc8 <_ZL7consumePv+0x18>
Producer thread, with empty loop (no writing to shared):
400c90: c7 05 e6 16 20 00 00 movl $0x0,0x2016e6(%rip) # 602380 <unshared>
400c97: 00 00 00
400c9a: 0f b6 05 5f 16 20 00 movzbl 0x20165f(%rip),%eax # 602300 <stop_producer>
400ca1: 84 c0 test %al,%al
400ca3: 74 eb je 400c90 <_ZL7producePv>
Producer thread, writing to shared:
400c90: c7 05 66 17 20 00 00 movl $0x0,0x201766(%rip) # 602400 <shared>
400c97: 00 00 00
400c9a: 0f b6 05 5f 16 20 00 movzbl 0x20165f(%rip),%eax # 602300 <stop_producer>
400ca1: 84 c0 test %al,%al
400ca3: 74 eb je 400c90 <_ZL7producePv>
The program counts the number of CPU cycles consumed, on consumer's core, to complete the whole loop. We compare the first producer, which does nothing but burn CPU cycles, to the second producer, which disrupts the consumer by repetitively writing to shared.
My system has a i5-4210U. That is, 2 cores, 2 threads per core. They are exposed by the kernel as Core#1 → cpu0, cpu2 Core#2 → cpu1, cpu3.
Result without starting the producer at all:
CONSUMER PRODUCER cycles for 1M cycles for 128k
3 n/a 2.11G 1.80G
Results with empty producer. For 1G operations (either 1000*1M or 8000*128k).
CONSUMER PRODUCER cycles for 1M cycles for 128k
3 3 3.20G 3.26G # mono
3 2 2.10G 1.80G # other core
3 1 4.18G 3.24G # same core, HT
As expected, since both threads are cpu hogs and both get a fair share, the producer burning cycles slows down consumer by about half. That's just cpu contention.
With producer on cpu#2, as there is no interaction, consumer runs with no impact from the producer running on another cpu.
With producer on cpu#1, we see hyperthreading at work.
Results with disruptive producer:
CONSUMER PRODUCER cycles for 1M cycles for 128k
3 3 4.26G 3.24G # mono
3 2 22.1 G 19.2 G # other core
3 1 36.9 G 37.1 G # same core, HT
When we schedule both thread on the same thread of the same core, there is no impact. Expected again, as the producer writes remain local, incurring no synchronization cost.
I cannot really explain why I get much worse performance for hyperthreading than for two cores. Advice welcome.

The killer problem is that the cores makes speculative reads, which means that each time a write to the the speculative read address (or more correctly to the same cache line) before it is "fulfilled" means the CPU must undo the read (at least if your an x86), which effectively means it cancels all speculative instructions from that instruction and later.
At some point before the read is retired it gets "fulfilled", ie. no instruction before can fail and there is no longer any reason to reissue, and the CPU can act as-if it had executed all instructions before.
Other core example
These are playing cache ping pong in addition to cancelling instructions so this should be worse than the HT version.
Lets start at some point in the process where the cache line with the shared data has just been marked shared because the Consumer has ask to read it.
The producer now wants to write to the shared data and sends out a request for exclusive ownership of the cache line.
The Consumer receives his cache line still in shared state and happily reads the value.
The consumer continues to read the shared value until the exclusive request arrives.
At which point the Consumer sends a shared request for the cache line.
At this point the Consumer clears its instructions from the first unfulfilled load instruction of the shared value.
While the Consumer waits for the data it runs ahead speculatively.
So the Consumer can advance in the period between it gets it shared cache line until its invalidated again. It is unclear how many reads can be fulfilled at the same time, most likely 2 as the CPU has 2 read ports. And it properbly doesn't need to rerun them as soon as the internal state of the CPU is satisfied they can't they can't fail between each.
Same core HT
Here the two HT shares the core and must share its resources.
The cache line should stay in the exclusive state all the time as they share the cache and therefore don't need the cache protocol.
Now why does it take so many cycles on the HT core? Lets start with the Consumer just having read the shared value.
Next cycle a write from the Produces occures.
The Consumer thread detects the write and cancels all its instructions from the first unfulfilled read.
The Consumer re-issues its instructions taking ~5-14 cycles to run again.
Finally the first instruction, which is a read, is issued and executed as it did not read a speculative value but a correct one as its in front of the queue.
So for every read of the shared value the Consumer is reset.
Conclusion
The different core apparently advance so much each time between each cache ping pong that it performs better than the HT one.
What would have happened if the CPU waited to see if the value had actually changed?
For the test code the HT version would have run much faster, maybe even as fast as the private write version. The different core would not have run faster as the cache miss was covering the reissue latency.
But if the data had been different the same problem would arise, except it would be worse for the different core version as it would then also have to wait for the cache line, and then reissue.
So if the OP can change some of roles letting the time stamp producer read from the shared and take the performance hit it would be better.
Read more here

How to divide by 9 using just shifts/add/sub?

Last week I was in an interview and there was a test like this:
Calculate N/9 (given that N is a positive integer), using only
SHIFT LEFT, SHIFT RIGHT, ADD, SUBSTRACT instructions.

first, find the representation of 1/9 in binary
0,0001110001110001
means it's (1/16) + (1/32) + (1/64) + (1/1024) + (1/2048) + (1/4096) + (1/65536)
so (x/9) equals (x>>4) + (x>>5) + (x>>6) + (x>>10) + (x>>11)+ (x>>12)+ (x>>16)
Possible optimization (if loops are allowed):
if you loop over 0001110001110001b right shifting it each loop,
add "x" to your result register whenever the carry was set on this shift
and shift your result right each time afterwards,
your result is x/9
mov cx, 16 ; assuming 16 bit registers
mov bx, 7281 ; bit mask of 2^16 * (1/9)
mov ax, 8166 ; sample value, (1/9 of it is 907)
mov dx, 0 ; dx holds the result
div9:
inc ax ; or "add ax,1" if inc's not allowed :)
; workaround for the fact that 7/64
; are a bit less than 1/9
shr bx,1
jnc no_add
add dx,ax
no_add:
shr dx,1
dec cx
jnz div9
( currently cannot test this, may be wrong)

you can use fixed point math trick.
so you just scale up so the significant fraction part goes to integer range, do the fractional math operation you need and scale back.
a/9 = ((a*10000)/9)/10000
as you can see I scaled by 10000. Now the integer part of 10000/9=1111 is big enough so I can write:
a/9 = ~a*1111/10000
power of 2 scale
If you use power of 2 scale then you need just to use bit-shift instead of division. You need to compromise between precision and input value range. I empirically found that on 32 bit arithmetics the best scale for this is 1<<18 so:
(((a+1)<<18)/9)>>18 = ~a/9;
The (a+1) corrects the rounding errors back to the right range.
Hardcoded multiplication
Rewrite the multiplication constant to binary
q = (1<<18)/9 = 29127 = 0111 0001 1100 0111 bin
Now if you need to compute c=(a*q) use hard-coded binary multiplication: for each 1 of the q you can add a<<(position_of_1) to the c. If you see something like 111 you can rewrite it to 1000-1 minimizing the number of operations.
If you put all of this together you should got something like this C++ code of mine:
DWORD div9(DWORD a)
{
// ((a+1)*q)>>18 = (((a+1)<<18)/9)>>18 = ~a/9;
// q = (1<<18)/9 = 29127 = 0111 0001 1100 0111 bin
// valid for a = < 0 , 147455 >
DWORD c;
c =(a<< 3)-(a ); // c= a*29127
c+=(a<< 9)-(a<< 6);
c+=(a<<15)-(a<<12);
c+=29127; // c= (a+1)*29127
c>>=18; // c= ((a+1)*29127)>>18
return c;
}
Now if you see the binary form the pattern 111000 is repeating so yu can further improve the code a bit:
DWORD div9(DWORD a)
{
DWORD c;
c =(a<<3)-a; // first pattern
c+=(c<<6)+(c<<12); // and the other 2...
c+=29127;
c>>=18;
return c;
}

Find nth SET bit in an int

Instead of just the lowest set bit, I want to find the position of the nth lowest set bit. (I'm NOT talking about value on the nth bit position)
For example, say I have:
0000 1101 1000 0100 1100 1000 1010 0000
And I want to find the 4th bit that is set. Then I want it to return:
0000 0000 0000 0000 0100 0000 0000 0000
If popcnt(v) < n, it would make sense if this function returned 0, but any behavior for this case is acceptable for me.
I'm looking for something faster than a loop if possible.

Nowadays this is very easy with PDEP from the BMI2 instruction set. Here is a 64-bit version with some examples:
#include <cassert>
#include <cstdint>
#include <x86intrin.h>
inline uint64_t nthset(uint64_t x, unsigned n) {
return _pdep_u64(1ULL << n, x);
}
int main() {
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 0) ==
0b0000'0000'0000'0000'0000'0000'0010'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 1) ==
0b0000'0000'0000'0000'0000'0000'1000'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 3) ==
0b0000'0000'0000'0000'0100'0000'0000'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 9) ==
0b0000'1000'0000'0000'0000'0000'0000'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 10) ==
0b0000'0000'0000'0000'0000'0000'0000'0000);
}
If you just want the (zero-based) index of the nth set bit, add a trailing zero count.
inline unsigned nthset(uint64_t x, unsigned n) {
return _tzcnt_u64(_pdep_u64(1ULL << n, x));
}

It turns out that it is indeed possible to do this with no loops. It is fastest to precompute the (at least) 8 bit version of this problem. Of course, these tables use up cache space, but there should still be a net speedup in virtually all modern pc scenarios. In this code, n=0 returns the least set bit, n=1 is second-to-least, etc.
Solution with __popcnt
There is a solution using the __popcnt intrinsic (you need __popcnt to be extremely fast or any perf gains over a simple loop solution will be moot. Fortunately most SSE4+ era processors support it).
// lookup table for sub-problem: 8-bit v
byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v < 256 and n < 8
ulong nthSetBit(ulong v, ulong n) {
ulong p = __popcnt(v & 0xFFFF);
ulong shift = 0;
if (p <= n) {
v >>= 16;
shift += 16;
n -= p;
}
p = __popcnt(v & 0xFF);
if (p <= n) {
shift += 8;
v >>= 8;
n -= p;
}
if (n >= 8) return 0; // optional safety, in case n > # of set bits
return PRECOMP[v & 0xFF][n] << shift;
}
This illustrates how the divide and conquer approach works.
General Solution
There is also a solution for "general" architectures- without __popcnt. It can be done by processing in 8-bit chunks. You need one more lookup table that tells you the popcnt of a byte:
byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v<256 and n < 8
byte POPCNT[256] = { ... } // POPCNT[v] is the number of set bits in v. (v < 256)
ulong nthSetBit(ulong v, ulong n) {
ulong p = POPCNT[v & 0xFF];
ulong shift = 0;
if (p <= n) {
n -= p;
v >>= 8;
shift += 8;
p = POPCNT[v & 0xFF];
if (p <= n) {
n -= p;
shift += 8;
v >>= 8;
p = POPCNT[v & 0xFF];
if (p <= n) {
n -= p;
shift += 8;
v >>= 8;
}
}
}
if (n >= 8) return 0; // optional safety, in case n > # of set bits
return PRECOMP[v & 0xFF][n] << shift;
}
This could, of course, be done with a loop, but the unrolled form is faster and the unusual form of the loop would make it unlikely that the compiler could automatically unroll it for you.

v-1 has a zero where v has its least significant "one" bit, while all more significant bits are the same. This leads to the following function:
int ffsn(unsigned int v, int n) {
for (int i=0; i<n-1; i++) {
v &= v-1; // remove the least significant bit
}
return v & ~(v-1); // extract the least significant bit
}

The version from bit-twiddling hacks adapted to this case is, for example,
unsigned int nth_bit_set(uint32_t value, unsigned int n)
{
const uint32_t pop2 = (value & 0x55555555u) + ((value >> 1) & 0x55555555u);
const uint32_t pop4 = (pop2 & 0x33333333u) + ((pop2 >> 2) & 0x33333333u);
const uint32_t pop8 = (pop4 & 0x0f0f0f0fu) + ((pop4 >> 4) & 0x0f0f0f0fu);
const uint32_t pop16 = (pop8 & 0x00ff00ffu) + ((pop8 >> 8) & 0x00ff00ffu);
const uint32_t pop32 = (pop16 & 0x000000ffu) + ((pop16 >>16) & 0x000000ffu);
unsigned int rank = 0;
unsigned int temp;
if (n++ >= pop32)
return 32;
temp = pop16 & 0xffu;
/* if (n > temp) { n -= temp; rank += 16; } */
rank += ((temp - n) & 256) >> 4;
n -= temp & ((temp - n) >> 8);
temp = (pop8 >> rank) & 0xffu;
/* if (n > temp) { n -= temp; rank += 8; } */
rank += ((temp - n) & 256) >> 5;
n -= temp & ((temp - n) >> 8);
temp = (pop4 >> rank) & 0x0fu;
/* if (n > temp) { n -= temp; rank += 4; } */
rank += ((temp - n) & 256) >> 6;
n -= temp & ((temp - n) >> 8);
temp = (pop2 >> rank) & 0x03u;
/* if (n > temp) { n -= temp; rank += 2; } */
rank += ((temp - n) & 256) >> 7;
n -= temp & ((temp - n) >> 8);
temp = (value >> rank) & 0x01u;
/* if (n > temp) rank += 1; */
rank += ((temp - n) & 256) >> 8;
return rank;
}
which, when compiled in a separate compilation unit, on gcc-5.4.0 using -Wall -O3 -march=native -mtune=native on Intel Core i5-4200u, yields
00400a40 <nth_bit_set>:
400a40: 89 f9 mov %edi,%ecx
400a42: 89 f8 mov %edi,%eax
400a44: 55 push %rbp
400a45: 40 0f b6 f6 movzbl %sil,%esi
400a49: d1 e9 shr %ecx
400a4b: 25 55 55 55 55 and $0x55555555,%eax
400a50: 53 push %rbx
400a51: 81 e1 55 55 55 55 and $0x55555555,%ecx
400a57: 01 c1 add %eax,%ecx
400a59: 41 89 c8 mov %ecx,%r8d
400a5c: 89 c8 mov %ecx,%eax
400a5e: 41 c1 e8 02 shr $0x2,%r8d
400a62: 25 33 33 33 33 and $0x33333333,%eax
400a67: 41 81 e0 33 33 33 33 and $0x33333333,%r8d
400a6e: 41 01 c0 add %eax,%r8d
400a71: 45 89 c1 mov %r8d,%r9d
400a74: 44 89 c0 mov %r8d,%eax
400a77: 41 c1 e9 04 shr $0x4,%r9d
400a7b: 25 0f 0f 0f 0f and $0xf0f0f0f,%eax
400a80: 41 81 e1 0f 0f 0f 0f and $0xf0f0f0f,%r9d
400a87: 41 01 c1 add %eax,%r9d
400a8a: 44 89 c8 mov %r9d,%eax
400a8d: 44 89 ca mov %r9d,%edx
400a90: c1 e8 08 shr $0x8,%eax
400a93: 81 e2 ff 00 ff 00 and $0xff00ff,%edx
400a99: 25 ff 00 ff 00 and $0xff00ff,%eax
400a9e: 01 d0 add %edx,%eax
400aa0: 0f b6 d8 movzbl %al,%ebx
400aa3: c1 e8 10 shr $0x10,%eax
400aa6: 0f b6 d0 movzbl %al,%edx
400aa9: b8 20 00 00 00 mov $0x20,%eax
400aae: 01 da add %ebx,%edx
400ab0: 39 f2 cmp %esi,%edx
400ab2: 77 0c ja 400ac0 <nth_bit_set+0x80>
400ab4: 5b pop %rbx
400ab5: 5d pop %rbp
400ab6: c3 retq
400ac0: 83 c6 01 add $0x1,%esi
400ac3: 89 dd mov %ebx,%ebp
400ac5: 29 f5 sub %esi,%ebp
400ac7: 41 89 ea mov %ebp,%r10d
400aca: c1 ed 08 shr $0x8,%ebp
400acd: 41 81 e2 00 01 00 00 and $0x100,%r10d
400ad4: 21 eb and %ebp,%ebx
400ad6: 41 c1 ea 04 shr $0x4,%r10d
400ada: 29 de sub %ebx,%esi
400adc: c4 42 2b f7 c9 shrx %r10d,%r9d,%r9d
400ae1: 41 0f b6 d9 movzbl %r9b,%ebx
400ae5: 89 dd mov %ebx,%ebp
400ae7: 29 f5 sub %esi,%ebp
400ae9: 41 89 e9 mov %ebp,%r9d
400aec: 41 81 e1 00 01 00 00 and $0x100,%r9d
400af3: 41 c1 e9 05 shr $0x5,%r9d
400af7: 47 8d 14 11 lea (%r9,%r10,1),%r10d
400afb: 41 89 e9 mov %ebp,%r9d
400afe: 41 c1 e9 08 shr $0x8,%r9d
400b02: c4 42 2b f7 c0 shrx %r10d,%r8d,%r8d
400b07: 41 83 e0 0f and $0xf,%r8d
400b0b: 44 21 cb and %r9d,%ebx
400b0e: 45 89 c3 mov %r8d,%r11d
400b11: 29 de sub %ebx,%esi
400b13: 5b pop %rbx
400b14: 41 29 f3 sub %esi,%r11d
400b17: 5d pop %rbp
400b18: 44 89 da mov %r11d,%edx
400b1b: 41 c1 eb 08 shr $0x8,%r11d
400b1f: 81 e2 00 01 00 00 and $0x100,%edx
400b25: 45 21 d8 and %r11d,%r8d
400b28: c1 ea 06 shr $0x6,%edx
400b2b: 44 29 c6 sub %r8d,%esi
400b2e: 46 8d 0c 12 lea (%rdx,%r10,1),%r9d
400b32: c4 e2 33 f7 c9 shrx %r9d,%ecx,%ecx
400b37: 83 e1 03 and $0x3,%ecx
400b3a: 41 89 c8 mov %ecx,%r8d
400b3d: 41 29 f0 sub %esi,%r8d
400b40: 44 89 c0 mov %r8d,%eax
400b43: 41 c1 e8 08 shr $0x8,%r8d
400b47: 25 00 01 00 00 and $0x100,%eax
400b4c: 44 21 c1 and %r8d,%ecx
400b4f: c1 e8 07 shr $0x7,%eax
400b52: 29 ce sub %ecx,%esi
400b54: 42 8d 14 08 lea (%rax,%r9,1),%edx
400b58: c4 e2 6b f7 c7 shrx %edx,%edi,%eax
400b5d: 83 e0 01 and $0x1,%eax
400b60: 29 f0 sub %esi,%eax
400b62: 25 00 01 00 00 and $0x100,%eax
400b67: c1 e8 08 shr $0x8,%eax
400b6a: 01 d0 add %edx,%eax
400b6c: c3 retq
When compiled as a separate compilation unit, timing on this machine is difficult, because the actual operation is as fast as calling a do-nothing function (also compiled in a separate compilation unit); essentially, the calculation is done during the latencies associated with the function call.
It seems to be slightly faster than my suggestion of a binary search,
unsigned int nth_bit_set(uint32_t value, unsigned int n)
{
uint32_t mask = 0x0000FFFFu;
unsigned int size = 16u;
unsigned int base = 0u;
if (n++ >= __builtin_popcount(value))
return 32;
while (size > 0) {
const unsigned int count = __builtin_popcount(value & mask);
if (n > count) {
base += size;
size >>= 1;
mask |= mask << size;
} else {
size >>= 1;
mask >>= size;
}
}
return base;
}
where the loop is executed exactly five times, compiling to
00400ba0 <nth_bit_set>:
400ba0: 83 c6 01 add $0x1,%esi
400ba3: 31 c0 xor %eax,%eax
400ba5: b9 10 00 00 00 mov $0x10,%ecx
400baa: ba ff ff 00 00 mov $0xffff,%edx
400baf: 45 31 db xor %r11d,%r11d
400bb2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400bb8: 41 89 c9 mov %ecx,%r9d
400bbb: 41 89 f8 mov %edi,%r8d
400bbe: 41 d0 e9 shr %r9b
400bc1: 41 21 d0 and %edx,%r8d
400bc4: c4 62 31 f7 d2 shlx %r9d,%edx,%r10d
400bc9: f3 45 0f b8 c0 popcnt %r8d,%r8d
400bce: 41 09 d2 or %edx,%r10d
400bd1: 44 38 c6 cmp %r8b,%sil
400bd4: 41 0f 46 cb cmovbe %r11d,%ecx
400bd8: c4 e2 33 f7 d2 shrx %r9d,%edx,%edx
400bdd: 41 0f 47 d2 cmova %r10d,%edx
400be1: 01 c8 add %ecx,%eax
400be3: 44 89 c9 mov %r9d,%ecx
400be6: 45 84 c9 test %r9b,%r9b
400be9: 75 cd jne 400bb8 <nth_bit_set+0x18>
400beb: c3 retq
as in, not more than 31 cycles in 95% of calls to the binary search version, compared to not more than 28 cycles in 95% of calls to the bit-hack version; both run within 28 cycles in 50% of the cases. (The loop version takes up to 56 cycles in 95% of calls, up to 37 cycles median.)
To determine which one is better in actual real-world code, one would have to do a proper benchmark within the real-world task; at least with current x86-64 architecture processors, the work done is easily hidden in latencies incurred elsewhere (like function calls).

My answer is mostly based on this implementation of a 64bit word select method (Hint: Look only at the MARISA_USE_POPCNT, MARISA_X64, MARISA_USE_SSE3 codepaths):
It works in two steps, first selecting the byte containing the n-th set bit and then using a lookup table inside the byte:
Extract the lower and higher nibbles for every byte (bitmasks 0xF, 0xF0, shift the higher nibbles down)
Replace the nibble values by their popcount (_mm_shuffle_epi8 with A000120)
Sum the popcounts of the lower and upper nibbles (Normal SSE addition) to get byte popcounts
Compute the prefix sum over all byte popcounts (multiplication with 0x01010101...)
Propagate the position n to all bytes (SSE broadcast or again multiplication with 0x01010101...)
Do a bytewise comparison (_mm_cmpgt_epi8 leaves 0xFF in every byte smaller than n)
Compute the byte offset by doing a popcount on the result
Now we know which byte contains the bit and a simple byte lookup table like in grek40's answer suffices to get the result.
Note however that I have not really benchmarked this result against other implementations, only that I have seen it to be quite efficient (and branchless)

I cant see a method without a loop, what springs to mind would be;
int set = 0;
int pos = 0;
while(set < n) {
if((bits & 0x01) == 1) set++;
bits = bits >> 1;
pos++;
}
after which, pos would hold the position of the nth lowest-value set bit.
The only other thing that I can think of would be a divide and conquer approach, which might yield O(log(n)) rather than O(n)...but probably not.
Edit: you said any behaviour, so non-termination is ok, right? :P

def bitN (l: Long, i: Int) : Long = {
def bitI (l: Long, i: Int) : Long =
if (i == 0) 1L else
2 * {
if (l % 2 == 0) bitI (l / 2, i) else bitI (l /2, i-1)
}
bitI (l, i) / 2
}
A recursive method (in scala). Decrement i, the position, if a modulo2 is 1. While returning, multiply by 2. Since the multiplication is invoced as last operation, it is not tail recursive, but since Longs are of known size in advance, the maximum stack is not too big.
scala> n.toBinaryString.replaceAll ("(.{8})", "$1 ")
res117: java.lang.String = 10110011 11101110 01011110 01111110 00111101 11100101 11101011 011000
scala> bitN (n, 40) .toBinaryString.replaceAll ("(.{8})", "$1 ")
res118: java.lang.String = 10000000 00000000 00000000 00000000 00000000 00000000 00000000 000000

Edit
After giving it some thought and using the __builtin_popcount function, I figured it might be better to decide on the relevant byte and then compute the whole result instead of incrementally adding/subtracting numbers. Here is an updated version:
int GetBitAtPosition(unsigned i, unsigned n)
{
unsigned bitCount;
bitCount = __builtin_popcount(i & 0x00ffffff);
if (bitCount <= n)
{
return (24 + LUT_BitPosition[i >> 24][n - bitCount]);
}
bitCount = __builtin_popcount(i & 0x0000ffff);
if (bitCount <= n)
{
return (16 + LUT_BitPosition[(i >> 16) & 0xff][n - bitCount]);
}
bitCount = __builtin_popcount(i & 0x000000ff);
if (bitCount <= n)
{
return (8 + LUT_BitPosition[(i >> 8) & 0xff][n - bitCount]);
}
return LUT_BitPosition[i & 0xff][n];
}
I felt like creating a LUT based solution where the number is inspected in byte-chunks, however, the LUT for the n-th bit position grew quite large (256*8) and the LUT-free version that was discussed in the comments might be better.
Generally the algorithm would look like this:
unsigned i = 0x000006B5;
unsigned n = 4;
unsigned result = 0;
unsigned bitCount;
while (i)
{
bitCount = LUT_BitCount[i & 0xff];
if (n < bitCount)
{
result += LUT_BitPosition[i & 0xff][n];
break; // found
}
else
{
n -= bitCount;
result += 8;
i >>= 8;
}
}
Might be worth to unroll the loop into its up to 4 iterations to get the best performance on 32 bit numbers.
The LUT for bitcount (could be replaced by __builtin_popcount):
unsigned LUT_BitCount[] = {
0, 1, 1, 2, 1, 2, 2, 3, // 0-7
1, 2, 2, 3, 2, 3, 3, 4, // 8-15
1, 2, 2, 3, 2, 3, 3, 4, // 16-23
2, 3, 3, 4, 3, 4, 4, 5, // 24-31
1, 2, 2, 3, 2, 3, 3, 4, // 32-39
2, 3, 3, 4, 3, 4, 4, 5, // 40-47
2, 3, 3, 4, 3, 4, 4, 5, // 48-55
3, 4, 4, 5, 4, 5, 5, 6, // 56-63
1, 2, 2, 3, 2, 3, 3, 4, // 64-71
2, 3, 3, 4, 3, 4, 4, 5, // 72-79
2, 3, 3, 4, 3, 4, 4, 5, // 80-87
3, 4, 4, 5, 4, 5, 5, 6, // 88-95
2, 3, 3, 4, 3, 4, 4, 5, // 96-103
3, 4, 4, 5, 4, 5, 5, 6, // 104-111
3, 4, 4, 5, 4, 5, 5, 6, // 112-119
4, 5, 5, 6, 5, 6, 6, 7, // 120-127
1, 2, 2, 3, 2, 3, 3, 4, // 128
2, 3, 3, 4, 3, 4, 4, 5, // 136
2, 3, 3, 4, 3, 4, 4, 5, // 144
3, 4, 4, 5, 4, 5, 5, 6, // 152
2, 3, 3, 4, 3, 4, 4, 5, // 160
3, 4, 4, 5, 4, 5, 5, 6, // 168
3, 4, 4, 5, 4, 5, 5, 6, // 176
4, 5, 5, 6, 5, 6, 6, 7, // 184
2, 3, 3, 4, 3, 4, 4, 5, // 192
3, 4, 4, 5, 4, 5, 5, 6, // 200
3, 4, 4, 5, 4, 5, 5, 6, // 208
4, 5, 5, 6, 5, 6, 6, 7, // 216
3, 4, 4, 5, 4, 5, 5, 6, // 224
4, 5, 5, 6, 5, 6, 6, 7, // 232
4, 5, 5, 6, 5, 6, 6, 7, // 240
5, 6, 6, 7, 6, 7, 7, 8, // 248-255
};
The LUT for bit position within a byte:
unsigned LUT_BitPosition[][8] = {
// 0-7
{UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
// 8-15
{3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
// 16-31
{4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,4,UINT_MAX,UINT_MAX,UINT_MAX},
// 32-63
{5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,5,UINT_MAX,UINT_MAX,UINT_MAX},
{4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
{3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,4,5,UINT_MAX,UINT_MAX},
// 64-127
{6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,6,UINT_MAX,UINT_MAX,UINT_MAX},
{4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
{3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,4,6,UINT_MAX,UINT_MAX},
{5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,5,6,UINT_MAX,UINT_MAX},
{4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,5,6,UINT_MAX,UINT_MAX},
{3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,5,6,UINT_MAX,UINT_MAX},
{2,3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,5,6,UINT_MAX,UINT_MAX},
{1,2,3,4,5,6,UINT_MAX,UINT_MAX},
{0,1,2,3,4,5,6,UINT_MAX},
// 128-255
{7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,7,UINT_MAX,UINT_MAX,UINT_MAX},
{4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
{3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,4,7,UINT_MAX,UINT_MAX},
{5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,5,7,UINT_MAX,UINT_MAX},
{4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,5,7,UINT_MAX,UINT_MAX},
{3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,5,7,UINT_MAX,UINT_MAX},
{2,3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,5,7,UINT_MAX,UINT_MAX},
{1,2,3,4,5,7,UINT_MAX,UINT_MAX},
{0,1,2,3,4,5,7,UINT_MAX},
{6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,6,7,UINT_MAX,UINT_MAX},
{4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,6,7,UINT_MAX,UINT_MAX},
{3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,6,7,UINT_MAX,UINT_MAX},
{2,3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,6,7,UINT_MAX,UINT_MAX},
{1,2,3,4,6,7,UINT_MAX,UINT_MAX},
{0,1,2,3,4,6,7,UINT_MAX},
{5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,5,6,7,UINT_MAX,UINT_MAX},
{3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,5,6,7,UINT_MAX,UINT_MAX},
{2,3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,5,6,7,UINT_MAX,UINT_MAX},
{1,2,3,5,6,7,UINT_MAX,UINT_MAX},
{0,1,2,3,5,6,7,UINT_MAX},
{4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,5,6,7,UINT_MAX,UINT_MAX},
{2,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,5,6,7,UINT_MAX,UINT_MAX},
{1,2,4,5,6,7,UINT_MAX,UINT_MAX},
{0,1,2,4,5,6,7,UINT_MAX},
{3,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,5,6,7,UINT_MAX,UINT_MAX},
{1,3,4,5,6,7,UINT_MAX,UINT_MAX},
{0,1,3,4,5,6,7,UINT_MAX},
{2,3,4,5,6,7,UINT_MAX,UINT_MAX},
{0,2,3,4,5,6,7,UINT_MAX},
{1,2,3,4,5,6,7,UINT_MAX},
{0,1,2,3,4,5,6,7},
};

My approach is to calculate the population count for each 8-bit quarters of the 32-bit integer in parallel, then find which quarter contains the nth bit. The population count of quarters that are lower than the found one can be summarized as the initial value of later calculation.
After that count set bits one-by-one until the n is reached. Without branches and using an incomplete implementation of population count algorithm, my example is the following:
#include <stdio.h>
#include <stdint.h>
int main() {
uint32_t n = 10, test = 3124375902u; /* 10111010001110100011000101011110 */
uint32_t index, popcnt, quarter = 0, q_popcnt;
/* count set bits of each quarter of 32-bit integer in parallel */
q_popcnt = test - ((test >> 1) & 0x55555555);
q_popcnt = (q_popcnt & 0x33333333) + ((q_popcnt >> 2) & 0x33333333);
q_popcnt = (q_popcnt + (q_popcnt >> 4)) & 0x0F0F0F0F;
popcnt = q_popcnt;
/* find which quarters can be summarized and summarize them */
quarter += (n + 1 >= (q_popcnt & 0xff));
quarter += (n + 1 >= ((q_popcnt += q_popcnt >> 8) & 0xff));
quarter += (n + 1 >= ((q_popcnt += q_popcnt >> 16) & 0xff));
quarter += (n + 1 >= ((q_popcnt += q_popcnt >> 24) & 0xff));
popcnt &= (UINT32_MAX >> (8 * quarter));
popcnt = (popcnt * 0x01010101) >> 24;
/* find the index of nth bit in quarter where it should be */
index = 8 * quarter;
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
printf("index = %u\n", index);
return 0;
}
A simple approach which uses loops and conditionals can be the following as well:
#include <stdio.h>
#include <stdint.h>
int main() {
uint32_t n = 11, test = 3124375902u; /* 10111010001110100011000101011110 */
uint32_t popcnt = 0, index = 0;
while(popcnt += ((test >> index) & 1), popcnt <= n && ++index < 32);
printf("index = %u\n", index);
return 0;
}

I know the question asks for something faster than a loop, but a complicated loop-less answer is likely to take longer than a quick loop.
If the computer has 32 bit ints and v is a random value then it might have for example 16 ones and if we are looking for a random place among the 16 ones, we might typically be looking for the 8th one. 7 or 8 times round a loop with just a couple of statements isn't too bad.
int findNthBit(unsigned int n, int v)
{
int next;
if (n > __builtin_popcount(v)) return 0;
while (next = v&v-1, --n)
{
v = next;
}
return v ^ next;
}
The loop works by removing the lowest set bit (n-1) times.
The n'th one bit that would be removed is the one bit we were looking for.
If anybody wants to test this ....
#include "stdio.h"
#include "assert.h"
// function here
int main() {
assert(findNthBit(1, 0)==0);
assert(findNthBit(1, 0xf0f)==1<<0);
assert(findNthBit(2, 0xf0f)==1<<1);
assert(findNthBit(3, 0xf0f)==1<<2);
assert(findNthBit(4, 0xf0f)==1<<3);
assert(findNthBit(5, 0xf0f)==1<<8);
assert(findNthBit(6, 0xf0f)==1<<9);
assert(findNthBit(7, 0xf0f)==1<<10);
assert(findNthBit(8, 0xf0f)==1<<11);
assert(findNthBit(9, 0xf0f)==0);
printf("looks good\n");
}
If there are concerns about the number of times the loop is executed, for example if the function is regularly called with large values of n, its simple to add an extra line or two of the following form
if (n > 8) return findNthBit(n-__builtin_popcount(v&0xff), v>>8) << 8;
or
if (n > 12) return findNthBit(n - __builtin_popcount(v&0xfff), v>>12) << 12;
The idea here is that the n'th one will never be located in the bottom n-1 bits. A better version clears not only the bottom 8 or 12 bits, but all the bottom (n-1) bits when n is large-ish and we don't want to loop that many times.
if (n > 7) return findNthBit(n - __builtin_popcount(v & ((1<<(n-1))-1)), v>>(n-1)) << (n-1);
I tested this with findNthBit(20, 0xaf5faf5f) and after clearing out the bottom 19 bits because the answer wasn't to be found there, it looked for the 5th bit in the remaining bits by looping 4 times to remove 4 ones.
So an improved version is
int findNthBit(unsigned int n, int v)
{
int next;
if (n > __builtin_popcount(v)) return 0;
if (n > 7) return findNthBit(n - __builtin_popcount(v & ((1<<(n-1))-1)), v>>(n-1)) << (n-1);
while (next = v&v-1, --n)
{
v = next;
}
return v ^ next;
}
The value 7, limiting looping is chosen fairly arbitrarily as a compromise between limiting looping and limiting recursion. The function could be further improved by removing recursion and keeping track of a shift amount instead. I may try this if I get some peace from home schooling my daughter!
Here is a final version with the recursion removed by keeping track of the number of low order bits shifted out from the bottom of the bits being searched.
Final version
int findNthBit(unsigned int n, int v)
{
int shifted = 0; // running total
int nBits; // value for this iteration
// handle no solution
if (n > __builtin_popcount(v)) return 0;
while (n > 7)
{
// for large n shift out lower n-1 bits from v.
nBits = n-1;
n -= __builtin_popcount(v & ((1<<nBits)-1));
v >>= nBits;
shifted += nBits;
}
int next;
// n is now small, clear out n-1 bits and return the next bit
// v&(v-1): a well known software trick to remove the lowest set bit.
while (next = v&(v-1), --n)
{
v = next;
}
return (v ^ next) << shifted;
}

Building on the answer given by Jukka Suomela, which uses a machine-specific instruction that may not necessarily be available, it is also possible to write a function that does exactly the same thing as _pdep_u64 without any machine dependencies. It must loop over the set bits in one of the arguments, but can still be described as a constexpr function for C++11.
constexpr inline uint64_t deposit_bits(uint64_t x, uint64_t mask, uint64_t b, uint64_t res) {
return mask != 0 ? deposit_bits(x, mask & (mask - 1), b << 1, ((x & b) ? (res | (mask & (-mask))) : res)) : res;
}
constexpr inline uint64_t nthset(uint64_t x, unsigned n) {
return deposit_bits(1ULL << n, x, 1, 0);
}

Based on a method by Juha Järvi published in the famous Bit Twiddling Hacks, I tested this implementation where n and i are used as in the question:
a = i - (i >> 1 & 0x55555555);
b = (a & 0x33333333) + (a >> 2 & 0x33333333);
c = b + (b >> 4) & 0x0f0f0f0f;
r = n + 1;
s = 0;
t = c + (c >> 8) & 0xff;
if (r > t) {
s += 16;
r -= t;
}
t = c >> s & 0xf;
if (r > t) {
s += 8;
r -= t;
}
t = b >> s & 0x7;
if (r > t) {
s += 4;
r -= t;
}
t = a >> s & 0x3;
if (r > t) {
s += 2;
r -= t;
}
t = i >> s & 0x1;
if (r > t)
s++;
return (s);
Based on my own tests, this is about as fast as the loop on x86, whereas it is 20% faster on arm64 and probably a lot faster on arm due to the fast conditional instructions, but I can't test this right now.

PDEP solution is great, but some languages like Java do not contain this intrinsic yet, however, are efficient in the other low-level operations. So I came up with the following fall back for such cases: a branchless binary search.
// n must be using 0-based indexing.
// This method produces correct results only if n is smaller
// than the number of set bits.
public static int getNthSetBit(long mask64, int n) {
// Binary search without branching
int base = 0;
final int low32 = (int) mask64;
final int high32n = n - Integer.bitCount(low32);
final int inLow32 = high32n >>> 31;
final int inHigh32 = inLow32 ^ 1;
final int shift32 = inHigh32 << 5;
final int mask32 = (int) (mask64 >>> shift32);
n = ((-inLow32) & n) | ((-inHigh32) & high32n);
base += shift32;
final int low16 = mask32 & 0xffff;
final int high16n = n - Integer.bitCount(low16);
final int inLow16 = high16n >>> 31;
final int inHigh16 = inLow16 ^ 1;
final int shift16 = inHigh16 << 4;
final int mask16 = (mask32 >>> shift16) & 0xffff;
n = ((-inLow16) & n) | ((-inHigh16) & high16n);
base += shift16;
final int low8 = mask16 & 0xff;
final int high8n = n - Integer.bitCount(low8);
final int inLow8 = high8n >>> 31;
final int inHigh8 = inLow8 ^ 1;
final int shift8 = inHigh8 << 3;
final int mask8 = (mask16 >>> shift8) & 0xff;
n = ((-inLow8) & n) | ((-inHigh8) & high8n);
base += shift8;
final int low4 = mask8 & 0xf;
final int high4n = n - Integer.bitCount(low4);
final int inLow4 = high4n >>> 31;
final int inHigh4 = inLow4 ^ 1;
final int shift4 = inHigh4 << 2;
final int mask4 = (mask8 >>> shift4) & 0xf;
n = ((-inLow4) & n) | ((-inHigh4) & high4n);
base += shift4;
final int low2 = mask4 & 3;
final int high2n = n - (low2 >> 1) - (low2 & 1);
final int inLow2 = high2n >>> 31;
final int inHigh2 = inLow2 ^ 1;
final int shift2 = inHigh2 << 1;
final int mask2 = (mask4 >>> shift2) & 3;
n = ((-inLow2) & n) | ((-inHigh2) & high2n);
base += shift2;
// For the 2 bits remaining, we can take a shortcut
return base + (n | ((mask2 ^ 1) & 1));
}

Most elegant way to expand card hand suits

I'm storing 4-card hands in a way to treat hands with different suits the same, e.g.:
9h 8h 7c 6c
is the same as
9d 8d 7h 6h
since you can replace one suit with another and have the same thing. It's easy to turn these into a unique representation using wildcards for suits. THe previous would become:
9A 8A 7B 6B
My question is - what's the most elegant way to turn the latter back into a list of the former? For example, when the input is 9A 8A 7B 6B, the output should be:
9c 8c 7d 6d
9c 8c 7h 6h
9c 8c 7s 6s
9h 8h 7d 6d
9h 8h 7c 6c
9h 8h 7s 6s
9d 8d 7c 6c
9d 8d 7h 6h
9d 8d 7s 6s
9s 8s 7d 6d
9s 8s 7h 6h
9s 8s 7c 6c
I have some ugly code that does this on a case-by-case basis depending on how many unique suits there are. It won't scale to hands with more cards. Also in a situation like:
7A 7B 8A 8B
it will have duplicates, since in this case A=c and B=d is the same as A=d and B=c.
What's an elegant way to solve this problem efficiently? I'm coding in C, but I can convert higher-level code down to C.

There are only 4 suits so the space of possible substitutions is really small - 4! = 24 cases.
In this case, I don't think it is worth it, to try to come up with something especially clever.
Just parse the string like "7A 7B 8A 8B", count the number of different letters in it, and based on that number, generate substitutions based on a precomputed set of substitutions.
1 letter -> 4 possible substitutions c, d, h, or s
2 letters -> 12 substitutions like in Your example.
3 or 4 letters -> 24 substitutions.
Then sort the set of substitutions and remove duplicates. You have do sort the tokens in every string like "7c 8d 9d 9s" and then sort an array of the strings to detect duplicates but that shouldn't be a problem. It's good to have the patterns like "7A 7B 8A 8B" sorted too (the tokens like: "7A", "8B" are in an ascending order).
EDIT:
An alternative for sorting might be, to detect identical sets if ranks associated with two or more suits and take it into account when generating substitutions, but it's more complicated I think. You would have to create a set of ranks for each letter appearing in the pattern string.
For example, for the string "7A 7B 8A 8B", with the letter A, associated is the set {7, 8} and the same set is associated with the letter B. Then You have to look for identical sets associated with different letters. In most cases those sets will have just one element, but they might have two as in the example above. Letters associated with the same set are interchangeable. You can have following situations
1 letter no duplicates -> 4 possible substitutions c, d, h, or s
2 letters no duplicates -> 12 substitutions.
2 letters, 2 letters interchangeable (identical sets for both letters) -> 6 substitutions.
3 letters no duplicates -> 24 substitutions.
3 letters, 2 letters interchangeable -> 12 substitutions.
4 letters no duplicates -> 24 substitutions.
4 letters, 2 letters interchangeable -> 12 substitutions.
4 letters, 3 letters interchangeable -> 4 substitutions.
4 letters, 2 pairs of interchangeable letters -> 6 substitutions.
4 letters, 4 letters interchangeable -> 1 substitution.

I think a generic permutation function that takes an array arr and an integer n and returns all possible permutations of n elements in that array would be useful here.
Find how how many unique suits exist in the hand. Then generate all possible permutations with those many elements from the actual suits [c, d, h, s]. Finally go through each permutation of suits, and assign each unknown letter [A, B, C, D] in the hand to the permuted values.
The following code in Ruby takes a given hand and generates all suit permutations. The heaviest work is being done by the Array.permutation(n) method here which should simplify things a lot for a corresponding C program as well.
# all 4 suits needed for generating permutations
suits = ["c", "d", "h", "s"]
# current hand
hand = "9A 8A 7B 6B"
# find number of unique suits in the hand. In this case it's 2 => [A, B]
unique_suits_in_hand = hand.scan(/.(.)\s?/).uniq.length
# generate all possible permutations of 2 suits, and for each permutation
# do letter assignments in the original hand
# tr is a translation function which maps corresponding letters in both strings.
# it doesn't matter which unknowns are used (A, B, C, D) since they
# will be replaced consistently.
# After suit assignments are done, we split the cards in hand, and sort them.
possible_hands = suits.permutation(unique_suits_in_hand).map do |perm|
hand.tr("ABCD", perm.join ).split(' ').sort
end
# Remove all duplicates
p possible_hands.uniq
The above code outputs
9c 8c 7d 6d
9c 8c 7h 6h
9c 8c 7s 6s
9d 8d 7c 6c
9d 8d 7h 6h
9d 8d 7s 6s
9h 8h 7c 6c
9h 8h 7d 6d
9h 8h 7s 6s
9s 8s 7c 6c
9s 8s 7d 6d
9s 8s 7h 6h

Represent suits as sparse arrays or lists, numbers as indexes, hands as associative arrays
In your example
H [A[07080000] B[07080000] C[00000000] D[00000000] ] (place for four cards)
To get the "real" hands always apply the 24 permutations (fixed time), so you don't have to care about how many cards has your hand A,B,C,D -> c,d,h,s with the following "trick"> store always in alphabetical order>
H1 [c[xxxxxx] d[xxxxxx] s[xxxxxx] h[xxxxxx]]
Since Hands are associative arrays, duplicated permutations does not generate two different output hands.

#include <stdio.h>
#include <stdlib.h>
const int RANK = 0;
const int SUIT = 1;
const int NUM_SUITS = 4;
const char STANDARD_SUITS[] = "dchs";
int usedSuits[] = {0, 0, 0, 0};
const char MOCK_SUITS[] = "ABCD";
const char BAD_SUIT = '*';
char pullSuit (int i) {
if (usedSuits [i] > 0) {
return BAD_SUIT;
}
++usedSuits [i];
return STANDARD_SUITS [i];
}
void unpullSuit (int i) {
--usedSuits [i];
}
int indexOfSuit (char suit, const char suits[]) {
int i;
for (i = 0; i < NUM_SUITS; ++i) {
if (suit == suits [i]) {
return i;
}
}
return -1;
}
int legitimateSuits (const char suits[]) {
return indexOfSuit (BAD_SUIT, suits) == -1;
}
int distinctSuits (const char suits[]) {
int i, j;
for (i = 0; i < NUM_SUITS; ++i) {
for (j = 0; j < NUM_SUITS; ++j) {
if (i != j && suits [i] == suits [j]) {
return 0;
}
}
}
return 1;
}
void printCards (char* mockCards[], int numMockCards, const char realizedSuits[]) {
int i;
for (i = 0; i < numMockCards; ++i) {
char* mockCard = mockCards [i];
char rank = mockCard [RANK];
char mockSuit = mockCard [SUIT];
int idx = indexOfSuit (mockSuit, MOCK_SUITS);
char realizedSuit = realizedSuits [idx];
printf ("%c%c ", rank, realizedSuit);
}
printf ("\n");
}
/*
* Example usage:
* char** mockCards = {"9A", "8A", "7B", "6B"};
* expand (mockCards, 4);
*/
void expand (char* mockCards[], int numMockCards) {
int i, j, k, l;
for (i = 0; i < NUM_SUITS; ++i) {
char a = pullSuit (i);
for (j = 0; j < NUM_SUITS; ++j) {
char b = pullSuit (j);
for (k = 0; k < NUM_SUITS; ++k) {
char c = pullSuit (k);
for (l = 0; l < NUM_SUITS; ++l) {
char d = pullSuit (l);
char realizedSuits[] = {a, b, c, d};
int legitimate = legitimateSuits (realizedSuits);
if (legitimate) {
int distinct = distinctSuits (realizedSuits);
if (distinct) {
printCards (mockCards, numMockCards, realizedSuits);
}
}
unpullSuit (l);
}
unpullSuit (k);
}
unpullSuit (j);
}
unpullSuit (i);
}
}
int main () {
char* mockCards[] = {"9A", "8A", "7B", "6B"};
expand (mockCards, 4);
return 0;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio