Most elegant way to expand card hand suits

Most elegant way to expand card hand suits - algorithm

I'm storing 4-card hands in a way to treat hands with different suits the same, e.g.:
9h 8h 7c 6c
is the same as
9d 8d 7h 6h
since you can replace one suit with another and have the same thing. It's easy to turn these into a unique representation using wildcards for suits. THe previous would become:
9A 8A 7B 6B
My question is - what's the most elegant way to turn the latter back into a list of the former? For example, when the input is 9A 8A 7B 6B, the output should be:
9c 8c 7d 6d
9c 8c 7h 6h
9c 8c 7s 6s
9h 8h 7d 6d
9h 8h 7c 6c
9h 8h 7s 6s
9d 8d 7c 6c
9d 8d 7h 6h
9d 8d 7s 6s
9s 8s 7d 6d
9s 8s 7h 6h
9s 8s 7c 6c
I have some ugly code that does this on a case-by-case basis depending on how many unique suits there are. It won't scale to hands with more cards. Also in a situation like:
7A 7B 8A 8B
it will have duplicates, since in this case A=c and B=d is the same as A=d and B=c.
What's an elegant way to solve this problem efficiently? I'm coding in C, but I can convert higher-level code down to C.

There are only 4 suits so the space of possible substitutions is really small - 4! = 24 cases.
In this case, I don't think it is worth it, to try to come up with something especially clever.
Just parse the string like "7A 7B 8A 8B", count the number of different letters in it, and based on that number, generate substitutions based on a precomputed set of substitutions.
1 letter -> 4 possible substitutions c, d, h, or s
2 letters -> 12 substitutions like in Your example.
3 or 4 letters -> 24 substitutions.
Then sort the set of substitutions and remove duplicates. You have do sort the tokens in every string like "7c 8d 9d 9s" and then sort an array of the strings to detect duplicates but that shouldn't be a problem. It's good to have the patterns like "7A 7B 8A 8B" sorted too (the tokens like: "7A", "8B" are in an ascending order).
EDIT:
An alternative for sorting might be, to detect identical sets if ranks associated with two or more suits and take it into account when generating substitutions, but it's more complicated I think. You would have to create a set of ranks for each letter appearing in the pattern string.
For example, for the string "7A 7B 8A 8B", with the letter A, associated is the set {7, 8} and the same set is associated with the letter B. Then You have to look for identical sets associated with different letters. In most cases those sets will have just one element, but they might have two as in the example above. Letters associated with the same set are interchangeable. You can have following situations
1 letter no duplicates -> 4 possible substitutions c, d, h, or s
2 letters no duplicates -> 12 substitutions.
2 letters, 2 letters interchangeable (identical sets for both letters) -> 6 substitutions.
3 letters no duplicates -> 24 substitutions.
3 letters, 2 letters interchangeable -> 12 substitutions.
4 letters no duplicates -> 24 substitutions.
4 letters, 2 letters interchangeable -> 12 substitutions.
4 letters, 3 letters interchangeable -> 4 substitutions.
4 letters, 2 pairs of interchangeable letters -> 6 substitutions.
4 letters, 4 letters interchangeable -> 1 substitution.

I think a generic permutation function that takes an array arr and an integer n and returns all possible permutations of n elements in that array would be useful here.
Find how how many unique suits exist in the hand. Then generate all possible permutations with those many elements from the actual suits [c, d, h, s]. Finally go through each permutation of suits, and assign each unknown letter [A, B, C, D] in the hand to the permuted values.
The following code in Ruby takes a given hand and generates all suit permutations. The heaviest work is being done by the Array.permutation(n) method here which should simplify things a lot for a corresponding C program as well.
# all 4 suits needed for generating permutations
suits = ["c", "d", "h", "s"]
# current hand
hand = "9A 8A 7B 6B"
# find number of unique suits in the hand. In this case it's 2 => [A, B]
unique_suits_in_hand = hand.scan(/.(.)\s?/).uniq.length
# generate all possible permutations of 2 suits, and for each permutation
# do letter assignments in the original hand
# tr is a translation function which maps corresponding letters in both strings.
# it doesn't matter which unknowns are used (A, B, C, D) since they
# will be replaced consistently.
# After suit assignments are done, we split the cards in hand, and sort them.
possible_hands = suits.permutation(unique_suits_in_hand).map do |perm|
hand.tr("ABCD", perm.join ).split(' ').sort
end
# Remove all duplicates
p possible_hands.uniq
The above code outputs
9c 8c 7d 6d
9c 8c 7h 6h
9c 8c 7s 6s
9d 8d 7c 6c
9d 8d 7h 6h
9d 8d 7s 6s
9h 8h 7c 6c
9h 8h 7d 6d
9h 8h 7s 6s
9s 8s 7c 6c
9s 8s 7d 6d
9s 8s 7h 6h

Represent suits as sparse arrays or lists, numbers as indexes, hands as associative arrays
In your example
H [A[07080000] B[07080000] C[00000000] D[00000000] ] (place for four cards)
To get the "real" hands always apply the 24 permutations (fixed time), so you don't have to care about how many cards has your hand A,B,C,D -> c,d,h,s with the following "trick"> store always in alphabetical order>
H1 [c[xxxxxx] d[xxxxxx] s[xxxxxx] h[xxxxxx]]
Since Hands are associative arrays, duplicated permutations does not generate two different output hands.

#include <stdio.h>
#include <stdlib.h>
const int RANK = 0;
const int SUIT = 1;
const int NUM_SUITS = 4;
const char STANDARD_SUITS[] = "dchs";
int usedSuits[] = {0, 0, 0, 0};
const char MOCK_SUITS[] = "ABCD";
const char BAD_SUIT = '*';
char pullSuit (int i) {
if (usedSuits [i] > 0) {
return BAD_SUIT;
}
++usedSuits [i];
return STANDARD_SUITS [i];
}
void unpullSuit (int i) {
--usedSuits [i];
}
int indexOfSuit (char suit, const char suits[]) {
int i;
for (i = 0; i < NUM_SUITS; ++i) {
if (suit == suits [i]) {
return i;
}
}
return -1;
}
int legitimateSuits (const char suits[]) {
return indexOfSuit (BAD_SUIT, suits) == -1;
}
int distinctSuits (const char suits[]) {
int i, j;
for (i = 0; i < NUM_SUITS; ++i) {
for (j = 0; j < NUM_SUITS; ++j) {
if (i != j && suits [i] == suits [j]) {
return 0;
}
}
}
return 1;
}
void printCards (char* mockCards[], int numMockCards, const char realizedSuits[]) {
int i;
for (i = 0; i < numMockCards; ++i) {
char* mockCard = mockCards [i];
char rank = mockCard [RANK];
char mockSuit = mockCard [SUIT];
int idx = indexOfSuit (mockSuit, MOCK_SUITS);
char realizedSuit = realizedSuits [idx];
printf ("%c%c ", rank, realizedSuit);
}
printf ("\n");
}
/*
* Example usage:
* char** mockCards = {"9A", "8A", "7B", "6B"};
* expand (mockCards, 4);
*/
void expand (char* mockCards[], int numMockCards) {
int i, j, k, l;
for (i = 0; i < NUM_SUITS; ++i) {
char a = pullSuit (i);
for (j = 0; j < NUM_SUITS; ++j) {
char b = pullSuit (j);
for (k = 0; k < NUM_SUITS; ++k) {
char c = pullSuit (k);
for (l = 0; l < NUM_SUITS; ++l) {
char d = pullSuit (l);
char realizedSuits[] = {a, b, c, d};
int legitimate = legitimateSuits (realizedSuits);
if (legitimate) {
int distinct = distinctSuits (realizedSuits);
if (distinct) {
printCards (mockCards, numMockCards, realizedSuits);
}
}
unpullSuit (l);
}
unpullSuit (k);
}
unpullSuit (j);
}
unpullSuit (i);
}
}
int main () {
char* mockCards[] = {"9A", "8A", "7B", "6B"};
expand (mockCards, 4);
return 0;
}

Related

Understanding branch prediction efficiency

I tried to measure branch prediction cost, I created a little program.
It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.
Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:
N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9
For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.
My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?
(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)
Here's the code:
#include <cstdint>
#include <iostream>
volatile uint64_t init[2] = { 314159165, 27182818 };
// volatile uint64_t init[2] = { 0, 0 };
volatile uint64_t one = 1;
uint64_t next(uint64_t s[2]) {
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;
}
int main() {
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];
uint64_t sum = 0;
#if 1
const int N = 16;
unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;
for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) {
for (int j=0; j<1<<N; j++) {
if (buffer[j]) {
sum += one;
}
}
}
#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++) {
if (next(s)&1) {
sum += one;
}
}
#endif
std::cout<<sum<<"\n";
}
(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)
Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):
401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:
N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9
Note: branch-misses (%) is calculated for 2^32 branches
As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.
The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.
The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):
(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

how to find xor key/algorithm, for a given hex?

So i have this hex: B0 32 B6 B4 37
I know this hex is obfuscated with some key/algorithm.
I also know this hex is equal to: 61 64 6d 69 6e (admin)
How can i calculate the XOR key for this?

If you write out the binary representation, you can see the pattern:
encoded decoded
10110000 -> 01100001
00110010 -> 01100100
Notice that the bit patterns have the same number of bits before and after. To decode, you just bitwise rotate one bit left. So the value shifts left one place and the most significant bit wraps around to the least significant place. To encode, just do the opposite.
int value, encoded_value;
encoded_value = 0xB0;
value = ((encoded_value << 1) | (encoded_value >> 7)) & 255;
// value will be 0x61;
encoded_value = ((value >> 1) | (value << 7)) & 255;

Find nth SET bit in an int

Instead of just the lowest set bit, I want to find the position of the nth lowest set bit. (I'm NOT talking about value on the nth bit position)
For example, say I have:
0000 1101 1000 0100 1100 1000 1010 0000
And I want to find the 4th bit that is set. Then I want it to return:
0000 0000 0000 0000 0100 0000 0000 0000
If popcnt(v) < n, it would make sense if this function returned 0, but any behavior for this case is acceptable for me.
I'm looking for something faster than a loop if possible.

Nowadays this is very easy with PDEP from the BMI2 instruction set. Here is a 64-bit version with some examples:
#include <cassert>
#include <cstdint>
#include <x86intrin.h>
inline uint64_t nthset(uint64_t x, unsigned n) {
return _pdep_u64(1ULL << n, x);
}
int main() {
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 0) ==
0b0000'0000'0000'0000'0000'0000'0010'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 1) ==
0b0000'0000'0000'0000'0000'0000'1000'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 3) ==
0b0000'0000'0000'0000'0100'0000'0000'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 9) ==
0b0000'1000'0000'0000'0000'0000'0000'0000);
assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 10) ==
0b0000'0000'0000'0000'0000'0000'0000'0000);
}
If you just want the (zero-based) index of the nth set bit, add a trailing zero count.
inline unsigned nthset(uint64_t x, unsigned n) {
return _tzcnt_u64(_pdep_u64(1ULL << n, x));
}

It turns out that it is indeed possible to do this with no loops. It is fastest to precompute the (at least) 8 bit version of this problem. Of course, these tables use up cache space, but there should still be a net speedup in virtually all modern pc scenarios. In this code, n=0 returns the least set bit, n=1 is second-to-least, etc.
Solution with __popcnt
There is a solution using the __popcnt intrinsic (you need __popcnt to be extremely fast or any perf gains over a simple loop solution will be moot. Fortunately most SSE4+ era processors support it).
// lookup table for sub-problem: 8-bit v
byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v < 256 and n < 8
ulong nthSetBit(ulong v, ulong n) {
ulong p = __popcnt(v & 0xFFFF);
ulong shift = 0;
if (p <= n) {
v >>= 16;
shift += 16;
n -= p;
}
p = __popcnt(v & 0xFF);
if (p <= n) {
shift += 8;
v >>= 8;
n -= p;
}
if (n >= 8) return 0; // optional safety, in case n > # of set bits
return PRECOMP[v & 0xFF][n] << shift;
}
This illustrates how the divide and conquer approach works.
General Solution
There is also a solution for "general" architectures- without __popcnt. It can be done by processing in 8-bit chunks. You need one more lookup table that tells you the popcnt of a byte:
byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v<256 and n < 8
byte POPCNT[256] = { ... } // POPCNT[v] is the number of set bits in v. (v < 256)
ulong nthSetBit(ulong v, ulong n) {
ulong p = POPCNT[v & 0xFF];
ulong shift = 0;
if (p <= n) {
n -= p;
v >>= 8;
shift += 8;
p = POPCNT[v & 0xFF];
if (p <= n) {
n -= p;
shift += 8;
v >>= 8;
p = POPCNT[v & 0xFF];
if (p <= n) {
n -= p;
shift += 8;
v >>= 8;
}
}
}
if (n >= 8) return 0; // optional safety, in case n > # of set bits
return PRECOMP[v & 0xFF][n] << shift;
}
This could, of course, be done with a loop, but the unrolled form is faster and the unusual form of the loop would make it unlikely that the compiler could automatically unroll it for you.

v-1 has a zero where v has its least significant "one" bit, while all more significant bits are the same. This leads to the following function:
int ffsn(unsigned int v, int n) {
for (int i=0; i<n-1; i++) {
v &= v-1; // remove the least significant bit
}
return v & ~(v-1); // extract the least significant bit
}

The version from bit-twiddling hacks adapted to this case is, for example,
unsigned int nth_bit_set(uint32_t value, unsigned int n)
{
const uint32_t pop2 = (value & 0x55555555u) + ((value >> 1) & 0x55555555u);
const uint32_t pop4 = (pop2 & 0x33333333u) + ((pop2 >> 2) & 0x33333333u);
const uint32_t pop8 = (pop4 & 0x0f0f0f0fu) + ((pop4 >> 4) & 0x0f0f0f0fu);
const uint32_t pop16 = (pop8 & 0x00ff00ffu) + ((pop8 >> 8) & 0x00ff00ffu);
const uint32_t pop32 = (pop16 & 0x000000ffu) + ((pop16 >>16) & 0x000000ffu);
unsigned int rank = 0;
unsigned int temp;
if (n++ >= pop32)
return 32;
temp = pop16 & 0xffu;
/* if (n > temp) { n -= temp; rank += 16; } */
rank += ((temp - n) & 256) >> 4;
n -= temp & ((temp - n) >> 8);
temp = (pop8 >> rank) & 0xffu;
/* if (n > temp) { n -= temp; rank += 8; } */
rank += ((temp - n) & 256) >> 5;
n -= temp & ((temp - n) >> 8);
temp = (pop4 >> rank) & 0x0fu;
/* if (n > temp) { n -= temp; rank += 4; } */
rank += ((temp - n) & 256) >> 6;
n -= temp & ((temp - n) >> 8);
temp = (pop2 >> rank) & 0x03u;
/* if (n > temp) { n -= temp; rank += 2; } */
rank += ((temp - n) & 256) >> 7;
n -= temp & ((temp - n) >> 8);
temp = (value >> rank) & 0x01u;
/* if (n > temp) rank += 1; */
rank += ((temp - n) & 256) >> 8;
return rank;
}
which, when compiled in a separate compilation unit, on gcc-5.4.0 using -Wall -O3 -march=native -mtune=native on Intel Core i5-4200u, yields
00400a40 <nth_bit_set>:
400a40: 89 f9 mov %edi,%ecx
400a42: 89 f8 mov %edi,%eax
400a44: 55 push %rbp
400a45: 40 0f b6 f6 movzbl %sil,%esi
400a49: d1 e9 shr %ecx
400a4b: 25 55 55 55 55 and $0x55555555,%eax
400a50: 53 push %rbx
400a51: 81 e1 55 55 55 55 and $0x55555555,%ecx
400a57: 01 c1 add %eax,%ecx
400a59: 41 89 c8 mov %ecx,%r8d
400a5c: 89 c8 mov %ecx,%eax
400a5e: 41 c1 e8 02 shr $0x2,%r8d
400a62: 25 33 33 33 33 and $0x33333333,%eax
400a67: 41 81 e0 33 33 33 33 and $0x33333333,%r8d
400a6e: 41 01 c0 add %eax,%r8d
400a71: 45 89 c1 mov %r8d,%r9d
400a74: 44 89 c0 mov %r8d,%eax
400a77: 41 c1 e9 04 shr $0x4,%r9d
400a7b: 25 0f 0f 0f 0f and $0xf0f0f0f,%eax
400a80: 41 81 e1 0f 0f 0f 0f and $0xf0f0f0f,%r9d
400a87: 41 01 c1 add %eax,%r9d
400a8a: 44 89 c8 mov %r9d,%eax
400a8d: 44 89 ca mov %r9d,%edx
400a90: c1 e8 08 shr $0x8,%eax
400a93: 81 e2 ff 00 ff 00 and $0xff00ff,%edx
400a99: 25 ff 00 ff 00 and $0xff00ff,%eax
400a9e: 01 d0 add %edx,%eax
400aa0: 0f b6 d8 movzbl %al,%ebx
400aa3: c1 e8 10 shr $0x10,%eax
400aa6: 0f b6 d0 movzbl %al,%edx
400aa9: b8 20 00 00 00 mov $0x20,%eax
400aae: 01 da add %ebx,%edx
400ab0: 39 f2 cmp %esi,%edx
400ab2: 77 0c ja 400ac0 <nth_bit_set+0x80>
400ab4: 5b pop %rbx
400ab5: 5d pop %rbp
400ab6: c3 retq
400ac0: 83 c6 01 add $0x1,%esi
400ac3: 89 dd mov %ebx,%ebp
400ac5: 29 f5 sub %esi,%ebp
400ac7: 41 89 ea mov %ebp,%r10d
400aca: c1 ed 08 shr $0x8,%ebp
400acd: 41 81 e2 00 01 00 00 and $0x100,%r10d
400ad4: 21 eb and %ebp,%ebx
400ad6: 41 c1 ea 04 shr $0x4,%r10d
400ada: 29 de sub %ebx,%esi
400adc: c4 42 2b f7 c9 shrx %r10d,%r9d,%r9d
400ae1: 41 0f b6 d9 movzbl %r9b,%ebx
400ae5: 89 dd mov %ebx,%ebp
400ae7: 29 f5 sub %esi,%ebp
400ae9: 41 89 e9 mov %ebp,%r9d
400aec: 41 81 e1 00 01 00 00 and $0x100,%r9d
400af3: 41 c1 e9 05 shr $0x5,%r9d
400af7: 47 8d 14 11 lea (%r9,%r10,1),%r10d
400afb: 41 89 e9 mov %ebp,%r9d
400afe: 41 c1 e9 08 shr $0x8,%r9d
400b02: c4 42 2b f7 c0 shrx %r10d,%r8d,%r8d
400b07: 41 83 e0 0f and $0xf,%r8d
400b0b: 44 21 cb and %r9d,%ebx
400b0e: 45 89 c3 mov %r8d,%r11d
400b11: 29 de sub %ebx,%esi
400b13: 5b pop %rbx
400b14: 41 29 f3 sub %esi,%r11d
400b17: 5d pop %rbp
400b18: 44 89 da mov %r11d,%edx
400b1b: 41 c1 eb 08 shr $0x8,%r11d
400b1f: 81 e2 00 01 00 00 and $0x100,%edx
400b25: 45 21 d8 and %r11d,%r8d
400b28: c1 ea 06 shr $0x6,%edx
400b2b: 44 29 c6 sub %r8d,%esi
400b2e: 46 8d 0c 12 lea (%rdx,%r10,1),%r9d
400b32: c4 e2 33 f7 c9 shrx %r9d,%ecx,%ecx
400b37: 83 e1 03 and $0x3,%ecx
400b3a: 41 89 c8 mov %ecx,%r8d
400b3d: 41 29 f0 sub %esi,%r8d
400b40: 44 89 c0 mov %r8d,%eax
400b43: 41 c1 e8 08 shr $0x8,%r8d
400b47: 25 00 01 00 00 and $0x100,%eax
400b4c: 44 21 c1 and %r8d,%ecx
400b4f: c1 e8 07 shr $0x7,%eax
400b52: 29 ce sub %ecx,%esi
400b54: 42 8d 14 08 lea (%rax,%r9,1),%edx
400b58: c4 e2 6b f7 c7 shrx %edx,%edi,%eax
400b5d: 83 e0 01 and $0x1,%eax
400b60: 29 f0 sub %esi,%eax
400b62: 25 00 01 00 00 and $0x100,%eax
400b67: c1 e8 08 shr $0x8,%eax
400b6a: 01 d0 add %edx,%eax
400b6c: c3 retq
When compiled as a separate compilation unit, timing on this machine is difficult, because the actual operation is as fast as calling a do-nothing function (also compiled in a separate compilation unit); essentially, the calculation is done during the latencies associated with the function call.
It seems to be slightly faster than my suggestion of a binary search,
unsigned int nth_bit_set(uint32_t value, unsigned int n)
{
uint32_t mask = 0x0000FFFFu;
unsigned int size = 16u;
unsigned int base = 0u;
if (n++ >= __builtin_popcount(value))
return 32;
while (size > 0) {
const unsigned int count = __builtin_popcount(value & mask);
if (n > count) {
base += size;
size >>= 1;
mask |= mask << size;
} else {
size >>= 1;
mask >>= size;
}
}
return base;
}
where the loop is executed exactly five times, compiling to
00400ba0 <nth_bit_set>:
400ba0: 83 c6 01 add $0x1,%esi
400ba3: 31 c0 xor %eax,%eax
400ba5: b9 10 00 00 00 mov $0x10,%ecx
400baa: ba ff ff 00 00 mov $0xffff,%edx
400baf: 45 31 db xor %r11d,%r11d
400bb2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400bb8: 41 89 c9 mov %ecx,%r9d
400bbb: 41 89 f8 mov %edi,%r8d
400bbe: 41 d0 e9 shr %r9b
400bc1: 41 21 d0 and %edx,%r8d
400bc4: c4 62 31 f7 d2 shlx %r9d,%edx,%r10d
400bc9: f3 45 0f b8 c0 popcnt %r8d,%r8d
400bce: 41 09 d2 or %edx,%r10d
400bd1: 44 38 c6 cmp %r8b,%sil
400bd4: 41 0f 46 cb cmovbe %r11d,%ecx
400bd8: c4 e2 33 f7 d2 shrx %r9d,%edx,%edx
400bdd: 41 0f 47 d2 cmova %r10d,%edx
400be1: 01 c8 add %ecx,%eax
400be3: 44 89 c9 mov %r9d,%ecx
400be6: 45 84 c9 test %r9b,%r9b
400be9: 75 cd jne 400bb8 <nth_bit_set+0x18>
400beb: c3 retq
as in, not more than 31 cycles in 95% of calls to the binary search version, compared to not more than 28 cycles in 95% of calls to the bit-hack version; both run within 28 cycles in 50% of the cases. (The loop version takes up to 56 cycles in 95% of calls, up to 37 cycles median.)
To determine which one is better in actual real-world code, one would have to do a proper benchmark within the real-world task; at least with current x86-64 architecture processors, the work done is easily hidden in latencies incurred elsewhere (like function calls).

My answer is mostly based on this implementation of a 64bit word select method (Hint: Look only at the MARISA_USE_POPCNT, MARISA_X64, MARISA_USE_SSE3 codepaths):
It works in two steps, first selecting the byte containing the n-th set bit and then using a lookup table inside the byte:
Extract the lower and higher nibbles for every byte (bitmasks 0xF, 0xF0, shift the higher nibbles down)
Replace the nibble values by their popcount (_mm_shuffle_epi8 with A000120)
Sum the popcounts of the lower and upper nibbles (Normal SSE addition) to get byte popcounts
Compute the prefix sum over all byte popcounts (multiplication with 0x01010101...)
Propagate the position n to all bytes (SSE broadcast or again multiplication with 0x01010101...)
Do a bytewise comparison (_mm_cmpgt_epi8 leaves 0xFF in every byte smaller than n)
Compute the byte offset by doing a popcount on the result
Now we know which byte contains the bit and a simple byte lookup table like in grek40's answer suffices to get the result.
Note however that I have not really benchmarked this result against other implementations, only that I have seen it to be quite efficient (and branchless)

I cant see a method without a loop, what springs to mind would be;
int set = 0;
int pos = 0;
while(set < n) {
if((bits & 0x01) == 1) set++;
bits = bits >> 1;
pos++;
}
after which, pos would hold the position of the nth lowest-value set bit.
The only other thing that I can think of would be a divide and conquer approach, which might yield O(log(n)) rather than O(n)...but probably not.
Edit: you said any behaviour, so non-termination is ok, right? :P

def bitN (l: Long, i: Int) : Long = {
def bitI (l: Long, i: Int) : Long =
if (i == 0) 1L else
2 * {
if (l % 2 == 0) bitI (l / 2, i) else bitI (l /2, i-1)
}
bitI (l, i) / 2
}
A recursive method (in scala). Decrement i, the position, if a modulo2 is 1. While returning, multiply by 2. Since the multiplication is invoced as last operation, it is not tail recursive, but since Longs are of known size in advance, the maximum stack is not too big.
scala> n.toBinaryString.replaceAll ("(.{8})", "$1 ")
res117: java.lang.String = 10110011 11101110 01011110 01111110 00111101 11100101 11101011 011000
scala> bitN (n, 40) .toBinaryString.replaceAll ("(.{8})", "$1 ")
res118: java.lang.String = 10000000 00000000 00000000 00000000 00000000 00000000 00000000 000000

Edit
After giving it some thought and using the __builtin_popcount function, I figured it might be better to decide on the relevant byte and then compute the whole result instead of incrementally adding/subtracting numbers. Here is an updated version:
int GetBitAtPosition(unsigned i, unsigned n)
{
unsigned bitCount;
bitCount = __builtin_popcount(i & 0x00ffffff);
if (bitCount <= n)
{
return (24 + LUT_BitPosition[i >> 24][n - bitCount]);
}
bitCount = __builtin_popcount(i & 0x0000ffff);
if (bitCount <= n)
{
return (16 + LUT_BitPosition[(i >> 16) & 0xff][n - bitCount]);
}
bitCount = __builtin_popcount(i & 0x000000ff);
if (bitCount <= n)
{
return (8 + LUT_BitPosition[(i >> 8) & 0xff][n - bitCount]);
}
return LUT_BitPosition[i & 0xff][n];
}
I felt like creating a LUT based solution where the number is inspected in byte-chunks, however, the LUT for the n-th bit position grew quite large (256*8) and the LUT-free version that was discussed in the comments might be better.
Generally the algorithm would look like this:
unsigned i = 0x000006B5;
unsigned n = 4;
unsigned result = 0;
unsigned bitCount;
while (i)
{
bitCount = LUT_BitCount[i & 0xff];
if (n < bitCount)
{
result += LUT_BitPosition[i & 0xff][n];
break; // found
}
else
{
n -= bitCount;
result += 8;
i >>= 8;
}
}
Might be worth to unroll the loop into its up to 4 iterations to get the best performance on 32 bit numbers.
The LUT for bitcount (could be replaced by __builtin_popcount):
unsigned LUT_BitCount[] = {
0, 1, 1, 2, 1, 2, 2, 3, // 0-7
1, 2, 2, 3, 2, 3, 3, 4, // 8-15
1, 2, 2, 3, 2, 3, 3, 4, // 16-23
2, 3, 3, 4, 3, 4, 4, 5, // 24-31
1, 2, 2, 3, 2, 3, 3, 4, // 32-39
2, 3, 3, 4, 3, 4, 4, 5, // 40-47
2, 3, 3, 4, 3, 4, 4, 5, // 48-55
3, 4, 4, 5, 4, 5, 5, 6, // 56-63
1, 2, 2, 3, 2, 3, 3, 4, // 64-71
2, 3, 3, 4, 3, 4, 4, 5, // 72-79
2, 3, 3, 4, 3, 4, 4, 5, // 80-87
3, 4, 4, 5, 4, 5, 5, 6, // 88-95
2, 3, 3, 4, 3, 4, 4, 5, // 96-103
3, 4, 4, 5, 4, 5, 5, 6, // 104-111
3, 4, 4, 5, 4, 5, 5, 6, // 112-119
4, 5, 5, 6, 5, 6, 6, 7, // 120-127
1, 2, 2, 3, 2, 3, 3, 4, // 128
2, 3, 3, 4, 3, 4, 4, 5, // 136
2, 3, 3, 4, 3, 4, 4, 5, // 144
3, 4, 4, 5, 4, 5, 5, 6, // 152
2, 3, 3, 4, 3, 4, 4, 5, // 160
3, 4, 4, 5, 4, 5, 5, 6, // 168
3, 4, 4, 5, 4, 5, 5, 6, // 176
4, 5, 5, 6, 5, 6, 6, 7, // 184
2, 3, 3, 4, 3, 4, 4, 5, // 192
3, 4, 4, 5, 4, 5, 5, 6, // 200
3, 4, 4, 5, 4, 5, 5, 6, // 208
4, 5, 5, 6, 5, 6, 6, 7, // 216
3, 4, 4, 5, 4, 5, 5, 6, // 224
4, 5, 5, 6, 5, 6, 6, 7, // 232
4, 5, 5, 6, 5, 6, 6, 7, // 240
5, 6, 6, 7, 6, 7, 7, 8, // 248-255
};
The LUT for bit position within a byte:
unsigned LUT_BitPosition[][8] = {
// 0-7
{UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
// 8-15
{3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
// 16-31
{4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,4,UINT_MAX,UINT_MAX,UINT_MAX},
// 32-63
{5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,5,UINT_MAX,UINT_MAX,UINT_MAX},
{4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
{3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,4,5,UINT_MAX,UINT_MAX},
// 64-127
{6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,6,UINT_MAX,UINT_MAX,UINT_MAX},
{4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
{3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,4,6,UINT_MAX,UINT_MAX},
{5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,5,6,UINT_MAX,UINT_MAX},
{4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,5,6,UINT_MAX,UINT_MAX},
{3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,5,6,UINT_MAX,UINT_MAX},
{2,3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,5,6,UINT_MAX,UINT_MAX},
{1,2,3,4,5,6,UINT_MAX,UINT_MAX},
{0,1,2,3,4,5,6,UINT_MAX},
// 128-255
{7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,7,UINT_MAX,UINT_MAX,UINT_MAX},
{4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
{3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,4,7,UINT_MAX,UINT_MAX},
{5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,5,7,UINT_MAX,UINT_MAX},
{4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,5,7,UINT_MAX,UINT_MAX},
{3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,5,7,UINT_MAX,UINT_MAX},
{2,3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,5,7,UINT_MAX,UINT_MAX},
{1,2,3,4,5,7,UINT_MAX,UINT_MAX},
{0,1,2,3,4,5,7,UINT_MAX},
{6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{2,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,3,6,7,UINT_MAX,UINT_MAX},
{4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,4,6,7,UINT_MAX,UINT_MAX},
{3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,4,6,7,UINT_MAX,UINT_MAX},
{2,3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,4,6,7,UINT_MAX,UINT_MAX},
{1,2,3,4,6,7,UINT_MAX,UINT_MAX},
{0,1,2,3,4,6,7,UINT_MAX},
{5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{1,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{2,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,2,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,2,5,6,7,UINT_MAX,UINT_MAX},
{3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,3,5,6,7,UINT_MAX,UINT_MAX},
{2,3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,3,5,6,7,UINT_MAX,UINT_MAX},
{1,2,3,5,6,7,UINT_MAX,UINT_MAX},
{0,1,2,3,5,6,7,UINT_MAX},
{4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
{0,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{1,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,1,4,5,6,7,UINT_MAX,UINT_MAX},
{2,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,2,4,5,6,7,UINT_MAX,UINT_MAX},
{1,2,4,5,6,7,UINT_MAX,UINT_MAX},
{0,1,2,4,5,6,7,UINT_MAX},
{3,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
{0,3,4,5,6,7,UINT_MAX,UINT_MAX},
{1,3,4,5,6,7,UINT_MAX,UINT_MAX},
{0,1,3,4,5,6,7,UINT_MAX},
{2,3,4,5,6,7,UINT_MAX,UINT_MAX},
{0,2,3,4,5,6,7,UINT_MAX},
{1,2,3,4,5,6,7,UINT_MAX},
{0,1,2,3,4,5,6,7},
};

My approach is to calculate the population count for each 8-bit quarters of the 32-bit integer in parallel, then find which quarter contains the nth bit. The population count of quarters that are lower than the found one can be summarized as the initial value of later calculation.
After that count set bits one-by-one until the n is reached. Without branches and using an incomplete implementation of population count algorithm, my example is the following:
#include <stdio.h>
#include <stdint.h>
int main() {
uint32_t n = 10, test = 3124375902u; /* 10111010001110100011000101011110 */
uint32_t index, popcnt, quarter = 0, q_popcnt;
/* count set bits of each quarter of 32-bit integer in parallel */
q_popcnt = test - ((test >> 1) & 0x55555555);
q_popcnt = (q_popcnt & 0x33333333) + ((q_popcnt >> 2) & 0x33333333);
q_popcnt = (q_popcnt + (q_popcnt >> 4)) & 0x0F0F0F0F;
popcnt = q_popcnt;
/* find which quarters can be summarized and summarize them */
quarter += (n + 1 >= (q_popcnt & 0xff));
quarter += (n + 1 >= ((q_popcnt += q_popcnt >> 8) & 0xff));
quarter += (n + 1 >= ((q_popcnt += q_popcnt >> 16) & 0xff));
quarter += (n + 1 >= ((q_popcnt += q_popcnt >> 24) & 0xff));
popcnt &= (UINT32_MAX >> (8 * quarter));
popcnt = (popcnt * 0x01010101) >> 24;
/* find the index of nth bit in quarter where it should be */
index = 8 * quarter;
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
index += ((popcnt += (test >> index) & 1) <= n);
printf("index = %u\n", index);
return 0;
}
A simple approach which uses loops and conditionals can be the following as well:
#include <stdio.h>
#include <stdint.h>
int main() {
uint32_t n = 11, test = 3124375902u; /* 10111010001110100011000101011110 */
uint32_t popcnt = 0, index = 0;
while(popcnt += ((test >> index) & 1), popcnt <= n && ++index < 32);
printf("index = %u\n", index);
return 0;
}

I know the question asks for something faster than a loop, but a complicated loop-less answer is likely to take longer than a quick loop.
If the computer has 32 bit ints and v is a random value then it might have for example 16 ones and if we are looking for a random place among the 16 ones, we might typically be looking for the 8th one. 7 or 8 times round a loop with just a couple of statements isn't too bad.
int findNthBit(unsigned int n, int v)
{
int next;
if (n > __builtin_popcount(v)) return 0;
while (next = v&v-1, --n)
{
v = next;
}
return v ^ next;
}
The loop works by removing the lowest set bit (n-1) times.
The n'th one bit that would be removed is the one bit we were looking for.
If anybody wants to test this ....
#include "stdio.h"
#include "assert.h"
// function here
int main() {
assert(findNthBit(1, 0)==0);
assert(findNthBit(1, 0xf0f)==1<<0);
assert(findNthBit(2, 0xf0f)==1<<1);
assert(findNthBit(3, 0xf0f)==1<<2);
assert(findNthBit(4, 0xf0f)==1<<3);
assert(findNthBit(5, 0xf0f)==1<<8);
assert(findNthBit(6, 0xf0f)==1<<9);
assert(findNthBit(7, 0xf0f)==1<<10);
assert(findNthBit(8, 0xf0f)==1<<11);
assert(findNthBit(9, 0xf0f)==0);
printf("looks good\n");
}
If there are concerns about the number of times the loop is executed, for example if the function is regularly called with large values of n, its simple to add an extra line or two of the following form
if (n > 8) return findNthBit(n-__builtin_popcount(v&0xff), v>>8) << 8;
or
if (n > 12) return findNthBit(n - __builtin_popcount(v&0xfff), v>>12) << 12;
The idea here is that the n'th one will never be located in the bottom n-1 bits. A better version clears not only the bottom 8 or 12 bits, but all the bottom (n-1) bits when n is large-ish and we don't want to loop that many times.
if (n > 7) return findNthBit(n - __builtin_popcount(v & ((1<<(n-1))-1)), v>>(n-1)) << (n-1);
I tested this with findNthBit(20, 0xaf5faf5f) and after clearing out the bottom 19 bits because the answer wasn't to be found there, it looked for the 5th bit in the remaining bits by looping 4 times to remove 4 ones.
So an improved version is
int findNthBit(unsigned int n, int v)
{
int next;
if (n > __builtin_popcount(v)) return 0;
if (n > 7) return findNthBit(n - __builtin_popcount(v & ((1<<(n-1))-1)), v>>(n-1)) << (n-1);
while (next = v&v-1, --n)
{
v = next;
}
return v ^ next;
}
The value 7, limiting looping is chosen fairly arbitrarily as a compromise between limiting looping and limiting recursion. The function could be further improved by removing recursion and keeping track of a shift amount instead. I may try this if I get some peace from home schooling my daughter!
Here is a final version with the recursion removed by keeping track of the number of low order bits shifted out from the bottom of the bits being searched.
Final version
int findNthBit(unsigned int n, int v)
{
int shifted = 0; // running total
int nBits; // value for this iteration
// handle no solution
if (n > __builtin_popcount(v)) return 0;
while (n > 7)
{
// for large n shift out lower n-1 bits from v.
nBits = n-1;
n -= __builtin_popcount(v & ((1<<nBits)-1));
v >>= nBits;
shifted += nBits;
}
int next;
// n is now small, clear out n-1 bits and return the next bit
// v&(v-1): a well known software trick to remove the lowest set bit.
while (next = v&(v-1), --n)
{
v = next;
}
return (v ^ next) << shifted;
}

Building on the answer given by Jukka Suomela, which uses a machine-specific instruction that may not necessarily be available, it is also possible to write a function that does exactly the same thing as _pdep_u64 without any machine dependencies. It must loop over the set bits in one of the arguments, but can still be described as a constexpr function for C++11.
constexpr inline uint64_t deposit_bits(uint64_t x, uint64_t mask, uint64_t b, uint64_t res) {
return mask != 0 ? deposit_bits(x, mask & (mask - 1), b << 1, ((x & b) ? (res | (mask & (-mask))) : res)) : res;
}
constexpr inline uint64_t nthset(uint64_t x, unsigned n) {
return deposit_bits(1ULL << n, x, 1, 0);
}

Based on a method by Juha Järvi published in the famous Bit Twiddling Hacks, I tested this implementation where n and i are used as in the question:
a = i - (i >> 1 & 0x55555555);
b = (a & 0x33333333) + (a >> 2 & 0x33333333);
c = b + (b >> 4) & 0x0f0f0f0f;
r = n + 1;
s = 0;
t = c + (c >> 8) & 0xff;
if (r > t) {
s += 16;
r -= t;
}
t = c >> s & 0xf;
if (r > t) {
s += 8;
r -= t;
}
t = b >> s & 0x7;
if (r > t) {
s += 4;
r -= t;
}
t = a >> s & 0x3;
if (r > t) {
s += 2;
r -= t;
}
t = i >> s & 0x1;
if (r > t)
s++;
return (s);
Based on my own tests, this is about as fast as the loop on x86, whereas it is 20% faster on arm64 and probably a lot faster on arm due to the fast conditional instructions, but I can't test this right now.

PDEP solution is great, but some languages like Java do not contain this intrinsic yet, however, are efficient in the other low-level operations. So I came up with the following fall back for such cases: a branchless binary search.
// n must be using 0-based indexing.
// This method produces correct results only if n is smaller
// than the number of set bits.
public static int getNthSetBit(long mask64, int n) {
// Binary search without branching
int base = 0;
final int low32 = (int) mask64;
final int high32n = n - Integer.bitCount(low32);
final int inLow32 = high32n >>> 31;
final int inHigh32 = inLow32 ^ 1;
final int shift32 = inHigh32 << 5;
final int mask32 = (int) (mask64 >>> shift32);
n = ((-inLow32) & n) | ((-inHigh32) & high32n);
base += shift32;
final int low16 = mask32 & 0xffff;
final int high16n = n - Integer.bitCount(low16);
final int inLow16 = high16n >>> 31;
final int inHigh16 = inLow16 ^ 1;
final int shift16 = inHigh16 << 4;
final int mask16 = (mask32 >>> shift16) & 0xffff;
n = ((-inLow16) & n) | ((-inHigh16) & high16n);
base += shift16;
final int low8 = mask16 & 0xff;
final int high8n = n - Integer.bitCount(low8);
final int inLow8 = high8n >>> 31;
final int inHigh8 = inLow8 ^ 1;
final int shift8 = inHigh8 << 3;
final int mask8 = (mask16 >>> shift8) & 0xff;
n = ((-inLow8) & n) | ((-inHigh8) & high8n);
base += shift8;
final int low4 = mask8 & 0xf;
final int high4n = n - Integer.bitCount(low4);
final int inLow4 = high4n >>> 31;
final int inHigh4 = inLow4 ^ 1;
final int shift4 = inHigh4 << 2;
final int mask4 = (mask8 >>> shift4) & 0xf;
n = ((-inLow4) & n) | ((-inHigh4) & high4n);
base += shift4;
final int low2 = mask4 & 3;
final int high2n = n - (low2 >> 1) - (low2 & 1);
final int inLow2 = high2n >>> 31;
final int inHigh2 = inLow2 ^ 1;
final int shift2 = inHigh2 << 1;
final int mask2 = (mask4 >>> shift2) & 3;
n = ((-inLow2) & n) | ((-inHigh2) & high2n);
base += shift2;
// For the 2 bits remaining, we can take a shortcut
return base + (n | ((mask2 ^ 1) & 1));
}

can anyone help with a possible dynamic programming algorithm

Let me first apologise for the crude manner in which I am about to phrase my question. I have been refered here by a member on another site who tells me that i am looking for a dynamic programming algorithm....my question is as follows.
I am trying to sort through some data and need to find a possible sequence in the numbers
Both sets of data include the same numbers listed in different orders as in the example below.
54 47 33 58 46 38 48 37 56 52 61 25 ………………first set
54 52 33 61 38 58 37 25 48 56 47 46 ………………second set
In this example Reading from left to right the numbers 54 52 61 and 25 occur in both sets in the same order.
So other possible solutions would be…
54 52 61 25
54 33 58 46
54 33 46
54 33 38 48 56
54 48 56…. Etc.
Although this can be done by hand, I have tons of this to get through and I keep making mistakes.
Does anyone know of an existing program or script that would output all of the possible solutions?
I understand the basic structure of c++ and virtual basic programs and should be able to cobble something together in ether but to be honest I haven’t done any serious programing since the days of the zx spectrum, so please go easy on me. my main problem however is not with the program language itself but that for some reason, I am finding it impossible to catalogue the steps required in order to complete this task in English let alone in any other language.
Darcy

Sounds like you are looking for 'all common subsequences (ACS)', which is a cousin of the (more common) longest common subsequence problem (LCS).
Here's a paper discussing ACS (though they focus on just counting subsequences, not enumerating).
To come up with an algorithm you should define the desired output more precisely. For the sake of argument, say you want the set of subsequences not contained in any longer subsequence. Then one algorithm would be:
1) Apply the DP algorithm for LCS, generating the alignment/backtrack matrix
2) Backtrack all possible LCS, marking the alignment positions visited.
3) Select the largest element of the matrix not yet marked (longest remaining subsequence)
4) Backtrack, recording the sequence and marking visited alignment positions.
5) While there exists an unmarked alignment positions, goto (3)
Backtracking in your case is complicated because you will have to visit all possible paths (called "all longest common subsequences"). You can find example implementations of LCS here, which may help to get you started.

I wrote this code and it outputs the longest common sequence. It is not super optimized though, the order is O(n*m) n-> array1 size, m-> array2 size:
private void start() {
int []a = {54, 47, 33, 58, 46, 38, 48, 37, 56, 52, 61, 25};
int []b = {54, 52, 33, 61, 38, 58, 37, 25, 48, 56, 47, 46};
System.out.println(search(a,b));
}
private String search(int[] a, int[] b)
{
return search(a, b, 0, 0).toString();
}
private Vector<Integer> search(int[] a, int[] b, int s1, int s2) {
Vector<Vector<Integer>> v = new Vector<Vector<Integer>>();
for ( int i = s1; i < a.length; i++ )
{
int newS2 = find(b, a[i], s2);
if ( newS2 != -1 )
{
Vector<Integer> temp = new Vector<Integer>();
temp.add(a[i]);
Vector<Integer> others = search(a, b, i+1, newS2 + 1);
for ( int k = 0; k < others.size(); k++)
temp.add( others.get(k));
v.add(temp);
}
}
int maxSize = 0;
Vector<Integer> ret = new Vector<Integer>();
for ( int i = 0; i < v.size(); i++)
if ( v.get(i).size() > maxSize )
{
maxSize = v.get(i).size();
ret = v.get(i);
}
return ret;
}
private int find(int[] b, int elemToFind, int s2) {
for ( int j = s2; j < b.length; j++)
if ( b[j] == elemToFind)
return j;
return -1;
}

Code Golf: Gray Code

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
The Challenge
The shortest program by character count that outputs the n-bit Gray Code. n will be an arbitrary number smaller than 1000100000 (due to user suggestions) that is taken from standard input. The gray code will be printed in standard output, like in the example.
Note: I don't expect the program to print the gray code in a reasonable time (n=100000 is overkill); I do expect it to start printing though.
Example
Input:
4
Expected Output:
0000
0001
0011
0010
0110
0111
0101
0100
1100
1101
1111
1110
1010
1011
1001
1000

Python - 53 chars
n=1<<input()
for x in range(n):print bin(n+x^x/2)[3:]
This 54 char version overcomes the limitation of range in Python2 so n=100000 works!
x,n=0,1<<input()
while n>x:print bin(n+x^x/2)[3:];x+=1
69 chars
G=lambda n:n and[x+y for x in'01'for y in G(n-1)[::1-2*int(x)]]or['']
75 chars
G=lambda n:n and['0'+x for x in G(n-1)]+['1'+x for x in G(n-1)[::-1]]or['']

APL (29 chars)
With the function F as (⌽ is the 'rotate' char)
z←x F y
z←(0,¨y),1,¨⌽y
This produces the Gray Code with 5 digits (⍴ is now the 'rho' char)
F/5⍴⊂0,1
The number '5' can be changed or be a variable.
(Sorry about the non-printable APL chars. SO won't let me post images as a new user)

Impossible! language (54 58 chars)
#l{'0,'1}1[;#l<][%;~['1%+].>.%['0%+].>.+//%1+]<>%[^].>
Test run:
./impossible gray.i! 5
Impossible v0.1.28
00000
00001
00011
00010
00110
00111
00101
00100
01100
01101
01111
01110
01010
01011
01001
01000
11000
11001
11011
11010
11110
11111
11101
11100
10100
10101
10111
10110
10010
10011
10001
10000
(actually I don't know if personal languages are allowed, since Impossible! is still under development, but I wanted to post it anyway..)

Golfscript - 27 chars
Reads from stdin, writes to stdout
~2\?:),{.2/^)+2base''*1>n}%
Sample run
$ echo 4 | ruby golfscript.rb gray.gs
0000
0001
0011
0010
0110
0111
0101
0100
1100
1101
1111
1110
1010
1011
1001
1000

Ruby - 49 chars
(1<<n=gets.to_i).times{|x|puts"%.#{n}b"%(x^x/2)}
This works for n=100000 with no problem

C++, 168 characters, not including whitespaces:
#include <iostream>
#include <string>
int r;
void x(std::string p, char f=48)
{
if(!r--)std::cout<<p<<'\n';else
{x(p+f);x(p+char(f^1),49);}
r++;
}
int main() {
std::cin>>r;
x("");
return 0;
}

Haskell, 82 characters:
f a=map('0':)a++map('1':)(reverse a)
main=interact$unlines.(iterate f[""]!!).read
Point-free style for teh win! (or at least 4 fewer strokes). Kudos to FUZxxl.
previous: 86 characters:
f a=map('0':)a++map('1':)(reverse a)
main=interact$ \s->unlines$iterate f[""]!!read s
Cut two strokes with interact, one with unlines.
older: 89 characters:
f a=map('0':)a++map('1':)(reverse a)
main=readLn>>= \s->putStr$concat$iterate f["\n"]!!s
Note that the laziness gets you your immediate output for free.

Mathematica 50 Chars
Nest[Join["0"<>#&/##,"1"<>#&/#Reverse##]&,{""},#]&
Thanks to A. Rex for suggestions!
Previous attempts
Here is my attempt in Mathematica (140 characters). I know that it isn't the shortest, but I think it is the easiest to follow if you are familiar with functional programming (though that could be my language bias showing). The addbit function takes an n-bit gray code and returns an n+1 bit gray code using the logic from the wikipedia page.. The make gray code function applies the addbit function in a nested manner to a 1 bit gray code, {{0}, {1}}, until an n-bit version is created. The charactercode function prints just the numbers without the braces and commas that are in the output of the addbit function.
addbit[set_] :=
Join[Map[Prepend[#, 0] &, set], Map[Prepend[#, 1] &, Reverse[set]]]
MakeGray[n_] :=
Map[FromCharacterCode, Nest[addbit, {{0}, {1}}, n - 1] + 48]

Straightforward Python implementation of what's described in Constructing an n-bit Gray code on Wikipedia:
import sys
def _gray(n):
if n == 1:
return [0, 1]
else:
p = _gray(n-1)
pr = [x + (1<<(n-1)) for x in p[::-1]]
return p + pr
n = int(sys.argv[1])
for i in [("0"*n + bin(a)[2:])[-n:] for a in _gray(n)]:
print i
(233 characters)
Test:
$ python gray.py 4
0000
0001
0011
0010
0110
0111
0101
0100
1100
1101
1111
1110
1010
1011
1001
1000

C, 203 Characters
Here's a sacrificial offering, in C:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
char s[256];
int b, i, j, m, g;
gets(s);
b = atoi(s);
for (i = 0; i < 1 << b; ++i)
{
g = i ^ (i / 2);
m = 1 << (b - 1);
for (j = 0; j < b; ++j)
{
s[j] = (g & m) ? '1' : '0';
m >>= 1;
}
s[j] = '\0';
puts(s);
}
return 0;
}

C#, 149143 characters
C# isn't the best language for code golf, but I thought I'd go at it anyway.
static void Main(){var s=1L<<int.Parse(Console.ReadLine());for(long i=0;i<s;i++){Console.WriteLine(Convert.ToString(s+i^i/2,2).Substring(1));}}
Readable version:
static void Main()
{
var s = 1L << int.Parse(Console.ReadLine());
for (long i = 0; i < s; i++)
{
Console.WriteLine(Convert.ToString(s + i ^ i / 2, 2).Substring(1));
}
}

And here is my Fantom sacrificial offering
public static Str[]grayCode(Int i){if(i==1)return["0","1"];else{p:=grayCode(i-1);p.addAll(p.dup.reverse);p.each|s,c|{if(c<(p.size/2))p[c]="0"+s;else p[c]="1"+s;};return p}}
(177 char)
Or the expanded version:
public static Str[] grayCode(Int i)
{
if (i==1) return ["0","1"]
else{
p := grayCode(i-1);
p.addAll(p.dup.reverse);
p.each |s,c|
{
if(c<(p.size/2))
{
p[c] = "0" + s
}
else
{
p[c] = "1" + s
}
}
return p
}
}

F#, 152 characters
let m=List.map;;let rec g l=function|1->l|x->g((m((+)"0")l)#(l|>List.rev|>m((+)"1")))(x - 1);;stdin.ReadLine()|>int|>g["0";"1"]|>List.iter(printfn "%s")

F# 180 175 too many characters
This morning I did another version, simplifying the recursive version, but alas due to recursion it wouldn't do the 100000.
Recursive solution:
let rec g m n l =
if(m = n) then l
else List.map ((+)"0") l # List.map ((+)"1") (List.rev(l)) |> g (m+1) n
List.iter (fun x -> printfn "%s" x) (g 1 (int(stdin.ReadLine())) ["0";"1"]);;
After that was done I created a working version for the "100000" requirement - it's too long to compete with the other solutions shown here and I probably re-invented the wheel several times over, but unlike many of the solutions I have seen here it will work with a very,very large number of bits and hey it was a good learning experience for an F# noob - I didn't bother to shorten it, since it's way too long anyway ;-)
Iterative solution: (working with 100000+)
let bits = stdin.ReadLine() |>int
let n = 1I <<< bits
let bitcount (n : bigint) =
let mutable m = n
let mutable c = 1
while m > 1I do
m <- m >>>1
c<-c+1
c
let rec traverseBits m (number: bigint) =
let highbit = bigint(1 <<< m)
if m > bitcount number
then number
else
let lowbit = 1 <<< m-1
if (highbit&&& number) > 0I
then
let newnum = number ^^^ bigint(lowbit)
traverseBits (m+1) newnum
else traverseBits (m+1) number
let res = seq
{
for i in 0I..n do
yield traverseBits 1 i
}
let binary n m = seq
{
for i = m-1 downto 0 do
let bit = bigint(1 <<< i)
if bit &&&n > 0I
then yield "1"
else yield "0"
}
Seq.iter (fun x -> printfn "%s" (Seq.reduce (+) (binary x bits))) res

Lua, 156 chars
This is my throw at it in Lua, as close as I can get it.
LuaJIT (or lua with lua-bitop): 156 bytes
a=io.read()n,w,b=2^a,io.write,bit;for x=0,n-1 do t=b.bxor(n+x,b.rshift(x,1))for k=a-1,0,-1 do w(t%2^k==t%n and 0 or 1)t=t%2^k==t and t or t%2^k end w'\n'end
Lua 5.2: 154 bytes
a=io.read()n,w,b=2^a,io.write,bit32;for x=0,n-1 do t=b.XOR(n+x,b.SHR(x,1))for k=a-1,0,-1 do w(t%2^k==t%n and 0 or 1)t=t%2^k==t and t or t%2^k end w'\n'end

In cut-free Prolog (138 bytes if you remove the space after '<<'; submission editor truncates the last line without it):
b(N,D):-D=0->nl;Q is D-1,H is N>>Q/\1,write(H),b(N,Q).
c(N,D):-N=0;P is N xor(N//2),b(P,D),M is N-1,c(M,D).
:-read(N),X is 1<< N-1,c(X,N).

Ruby, 50 Chars
(2**n=gets.to_i).times{|i|puts"%0#{n}d"%i.to_s(2)}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio