Bit hack to generate all integers with a given number of 1s - bit

I forgot a bit hack to generate all integers with a given number of 1s. Does anybody remember it (and probably can explain it also)?

From Bit Twiddling Hacks
Update Test program Live On Coliru
#include <utility>
#include <iostream>
#include <bitset>
using I = uint8_t;
auto dump(I v) { return std::bitset<sizeof(I) * __CHAR_BIT__>(v); }
I bit_twiddle_permute(I v) {
I t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
I w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
return w;
}
int main() {
I p = 0b001001;
std::cout << dump(p) << "\n";
for (I n = bit_twiddle_permute(p); n>p; p = n, n = bit_twiddle_permute(p)) {
std::cout << dump(n) << "\n";
}
}
Prints
00001001
00001010
00001100
00010001
00010010
00010100
00011000
00100001
00100010
00100100
00101000
00110000
01000001
01000010
01000100
01001000
01010000
01100000
10000001
10000010
10000100
10001000
10010000
10100000
11000000
Compute the lexicographically next bit permutation
Suppose we have a pattern of N bits set to 1 in an integer and we want the next permutation of N 1 bits in a lexicographical sense. For example, if N is 3 and the bit pattern is 00010011, the next patterns would be 00010101, 00010110, 00011001,00011010, 00011100, 00100011, and so forth. The following is a fast way to compute the next permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros. If you are using Microsoft compilers for x86, the intrinsic is _BitScanForward. These both emit a bsf instruction, but equivalents may be available for other architectures. If not, then consider using one of the methods for counting the consecutive zero bits mentioned earlier.
Here is another version that tends to be slower because of its division operator, but it
does not require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.

For bit hacks I like to refer to this page: Bit Twiddling Hacks.
Regarding your specific question, read the part entitled Compute the lexicographically next bit permutation.
Compute the lexicographically next bit permutation
Suppose we have a pattern of N bits set to 1 in an integer and we want the next permutation of N 1 bits in a lexicographical sense. For example, if N is 3 and the bit pattern is 00010011, the next patterns would be 00010101, 00010110, 00011001,00011010, 00011100, 00100011, and so forth. The following is a fast way to compute the next permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros. If you are using Microsoft compilers for x86, the intrinsic is _BitScanForward. These both emit a bsf instruction, but equivalents may be available for other architectures. If not, then consider using one of the methods for counting the consecutive zero bits mentioned earlier.
Here is another version that tends to be slower because of its division operator, but it does not require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.

To add onto #sehe's answer included below (originally from Dario Sneidermanis also at http://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation.)
#include <utility>
#include <iostream>
#include <bitset>
using I = uint8_t;
auto dump(I v) { return std::bitset<sizeof(I) * __CHAR_BIT__>(v); }
I bit_twiddle_permute(I v) {
I t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
I w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
return w;
}
int main() {
I p = 0b001001;
std::cout << dump(p) << "\n";
for (I n = bit_twiddle_permute(p); n>p; p = n, n = bit_twiddle_permute(p))
{
std::cout << dump(n) << "\n";
}
}
There are boundary issues with bit_twiddle_permute(I v). Whenever v is the last permutation, t is all 1's (e.g. 2^8 - 1), (~t & -~t) = 0, and w is the first permutation of bits with one fewer 1s than v, except when v = 000000000 in which case w = 01111111. In particular if you set p to 0; the loop in main will produce all permutations with seven 1's, and the following slight modification of the for loop, will cycle through all permutations with 0, 7, 6, ..., 1 bits set -
for (I n = bit_twiddle_permute(p); n>p; n = bit_twiddle_permute(n))
If this is the intention, it is perhaps worth a comment. If not it is trivial to fix, e.g.
if (t == (I)(-1)) { return v >> __builtin_ctz(v); }
So with an additional small simplification
I bit_twiddle_permute2(I v) {
I t = (v | (v - 1)) + 1;
if (t == 0) { return v >> __builtin_ctz(v); }
I w = t | ((~t & v) >> (__builtin_ctz(v) + 1));
return w;
}
int main() {
I p = 0b1;
cout << dump(p) << "\n";
for (I n = bit_twiddle_permute2(p); n>p; n = bit_twiddle_permute2(n)) {
cout << dump(n) << "\n";
}
}
The following adaptation of Dario Sneidermanis's idea may be slightly easier to follow
I bit_twiddle_permute3(I v) {
int n = __builtin_ctz(v);
I s = v >> n;
I t = s + 1;
I w = (t << n) | ((~t & s) >> 1);
return w;
}
or with a similar solution to the issue I mentioned at the beginning of this post
I bit_twiddle_permute3(I v) {
int n = __builtin_ctz(v);
I s = v >> n;
I t = s + 1;
if (v == 0 || t << n == 0) { return s; }
I w = (t << n) | ((~t & s) >> 1);
return w;
}

Related

How to generate uniform single precision floating point random number between 0 and 1 in FPGA?

I am trying to generate single precision floating point random number using FPGA by generating number between 0 and 0x3f80000 (IEEE format for 1). But since there are more number of discreet points near to zero than 1, I am not getting uniform generation. Is there any transformation which I can apply to mimic uniform generation. I am using LFSR(32 Bit) and Xoshiro random number generation.
A standard way to generate uniformly distributed floats in [0,1) from uniformly distributed 32-bit unsigned integers is to multiply the integers with 2-32. Obviously we wouldn't instantiate a floating-point multiplier on the FPGA just for this purpose, and we do not have to, since the multiplier is a power of two. In essence what is needed is a conversion of the integer to a floating-point number, then decrementing the exponent of the floating-point number by 32. This does not work for a zero input which has to be handled as a special case. In the ISO-C99 code below I am assuming that float is mapped to IEEE-754 binary32 type.
Other than for certain special cases, the significand of an IEEE-754 binary floating-point number is normalized to [1,2). To convert an integer into the significand, we need to normalize it, so the most significant bit is set. We can do this by counting the number of leading zero bits, then left shifting the number by that amount. The count of leading zeros is also needed to adjust the exponent.
The significand of a binary32 number comprises 24 bits, of which only 23 bits are stored; the most significant bit (the integer bit) is always one and therefore implicit. This means not all of the 32 bits of the integer can be incorporated into the binary32, so in converting a 32-bit unsigned integer one usually rounds to 24-bit precision. To simplify the implementation, in the code below I simply truncate by cutting off the least significant eight bits, which should have no noticeable effect on the uniform distribution. For the exponent part, we can combine the adjustments due to normalization step with the subtraction due to the scale factor of 2-32.
The code below is written using hardware-centric primitives. Extracting a bit is just a question of grabbing the correct wire, and shifts by fixed amounts are likewise simply wire shifts. The circuit needed to count the number of leading zeros is typically called a priority encoder.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#define USE_FP_MULTIPLY (0)
uint32_t bit (uint32_t, uint32_t);
uint32_t mux (uint32_t, uint32_t, uint32_t);
uint32_t clz (uint32_t);
float uint32_as_float (uint32_t);
/* uniform float in [0, 1) from uniformly distributed random integers */
float uniform_rand_01 (uint32_t i)
{
const uint32_t FP32_EXPO_BIAS = 127;
const uint32_t FP32_MANT_BITS = 24;
const uint32_t FP32_STORED_MANT_BITS = FP32_MANT_BITS - 1;
uint32_t lz, r;
// compute shift amount needed for normalization
lz = clz (i);
// normalize so that msb is set, except when input is zero
i = mux (bit (lz, 4), i << 16, i);
i = mux (bit (lz, 3), i << 8, i);
i = mux (bit (lz, 2), i << 4, i);
i = mux (bit (lz, 1), i << 2, i);
i = mux (bit (lz, 0), i << 1, i);
// build bit pattern for IEEE-754 binary32 floating-point number
r = (((FP32_EXPO_BIAS - 2 - lz) << FP32_STORED_MANT_BITS) +
(i >> (32 - FP32_MANT_BITS)));
// handle special case of zero input
r = mux (i == 0, i, r);
// treat bit-pattern as 'float'
return uint32_as_float (r);
}
// extract bit i from x
uint32_t bit (uint32_t x, uint32_t i)
{
return (x >> i) & 1;
}
// simulate 2-to-1 multiplexer: c ? a : b ; c must be in {0,1}
uint32_t mux (uint32_t c, uint32_t a, uint32_t b)
{
uint32_t m = c * 0xffffffff;
return (a & m) | (b & ~m);
}
// count leading zeros. A priority encoder in hardware.
uint32_t clz (uint32_t x)
{
uint32_t m, c, y, n = 32;
y = x >> 16; m = n - 16; c = (y != 0); n = mux (c, m, n); x = mux (c, y, x);
y = x >> 8; m = n - 8; c = (y != 0); n = mux (c, m, n); x = mux (c, y, x);
y = x >> 4; m = n - 4; c = (y != 0); n = mux (c, m, n); x = mux (c, y, x);
y = x >> 2; m = n - 2; c = (y != 0); n = mux (c, m, n); x = mux (c, y, x);
y = x >> 1; m = n - 2; c = (y != 0); n = mux (c, m, n - x);
return n;
}
// re-interpret bit pattern of a 32-bit integer as an IEEE-754 binary32
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)
#define N 100
uint32_t bucket [N];
int main (void)
{
for (int i = 0; i < 100000; i++) {
uint32_t i = KISS;
#if USE_FP_MULTIPLY
float r = i * 0x1.0p-32f;
#else // USE_FP_MULTIPLY
float r = uniform_rand_01 (i);
#endif // USE_FP_MULTIPLY
bucket [(int)(r * N)]++;
}
for (int i = 0; i < N; i++) {
printf ("bucket [%2d]: [%.5f,%.5f): %u\n",
i, 1.0f*i/N, (i+1.0f)/N, bucket[i]);
}
return EXIT_SUCCESS;
}
Please check the xoshiro128+ here https://prng.di.unimi.it/xoshiro128plus.c
The VHDL code written by someone can be found here:
https://github.com/jorisvr/vhdl_prng/tree/master/rtl
The seed value is generated from another random number generation algorithm so don't get confused by this.
Depending on the seed value used it should give a uniform distribution.

How to partly sort arrays on CUDA?

Problem
Provided I have two arrays:
const int N = 1000000;
float A[N];
myStruct *B[N];
The numbers in A can be positive or negative (e.g. A[N]={3,2,-1,0,5,-2}), how can I make the array A partly sorted (all positive values first, not need to be sorted, then negative values)(e.g. A[N]={3,2,5,0,-1,-2} or A[N]={5,2,3,0,-2,-1}) on the GPU? The array B should be changed according to A (A is keys, B is values).
Since the scale of A,B can be very large, I think the sort algorithm should be implemented on GPU (especially on CUDA, because I use this platform). Surely I know thrust::sort_by_key can do this work, but it does muck extra work since I do not need the array A&B to be sorted entirely.
Has anyone come across this kind of problem?
Thrust example
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
thrust::greater<float>() );
Thrust's documentation on Github is not up-to-date. As #JaredHoberock said, thrust::partition is the way to go since it now supports stencils. You may need to get a copy from the Github repository:
git clone git://github.com/thrust/thrust.git
Then run scons doc in the Thrust folder to get an updated documentation, and use these updated Thrust sources when compiling your code (nvcc -I/path/to/thrust ...). With the new stencil partition, you can do:
#include <thrust/partition.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
thrust::partition(thrust::host, // if you want to test on the host
thrust::make_zip_iterator(thrust::make_tuple(keyVec.begin(), valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(keyVec.end(), valVec.end())),
keyVec.begin(),
is_positive());
This returns:
Before:
keyVec = 0 -1 2 -3 4 -5 6 -7 8 -9
valVec = 0 1 2 3 4 5 6 7 8 9
After:
keyVec = 0 2 4 6 8 -5 -3 -7 -1 -9
valVec = 0 2 4 6 8 5 3 7 1 9
Note that the 2 partitions are not necessarily sorted. Also, the order may differ between the original vectors and the partitions. If this is important to you, you can use thrust::stable_partition:
stable_partition differs from partition in that stable_partition is
guaranteed to preserve relative order. That is, if x and y are
elements in [first, last), such that pred(x) == pred(y), and if x
precedes y, then it will still be true after stable_partition that x
precedes y.
If you want a complete example, here it is:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/partition.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
void print_vec(const thrust::host_vector<int>& v)
{
for(size_t i = 0; i < v.size(); i++)
std::cout << " " << v[i];
std::cout << "\n";
}
int main ()
{
const int N = 10;
thrust::host_vector<int> keyVec(N);
thrust::host_vector<int> valVec(N);
int sign = 1;
for(int i = 0; i < N; ++i)
{
keyVec[i] = sign * i;
valVec[i] = i;
sign *= -1;
}
// Copy host to device
thrust::device_vector<int> d_keyVec = keyVec;
thrust::device_vector<int> d_valVec = valVec;
std::cout << "Before:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
// Partition key-val on device
thrust::partition(thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.begin(), d_valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.end(), d_valVec.end())),
d_keyVec.begin(),
is_positive());
// Copy result back to host
keyVec = d_keyVec;
valVec = d_valVec;
std::cout << "After:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
}
UPDATE
I made a quick comparison with the thrust::sort_by_key version, and the thrust::partition implementation does seem to be faster (which is what we could naturally expect). Here is what I obtain on NVIDIA Visual Profiler, with N = 1024 * 1024, with the sort version on the left, and the partition version on the right. You may want to do the same kind of tests on your own.
How about this?:
Count how many positive numbers to determine the inflexion point
Evenly divide each side of the inflexion point into groups (negative-groups are all same length but different length to positive-groups. these groups are the memory chunks for the results)
Use one kernel call (one thread) per chunk pair
Each kernel swaps any out-of-place elements in the input groups into the desired output groups. You will need to flag any chunks that have more swaps than the maximum so that you can fix them during subsequent iterations.
Repeat until done
Memory traffic is swaps only (from original element position, to sorted position). I don't know if this algorithm sounds like anything already defined...
You should be able to achieve this in thrust simply with a modification of your comparison operator:
struct my_compare
{
__device__ __host__ bool operator()(const float x, const float y) const
{
return !((x<0.0f) && (y>0.0f));
}
};
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
my_compare() );

Expand Right Bitwise Algorithm

Originally this post requested an inverse sheep-and-goats operation, but I realized that it was more than I really needed, so I edited the title, because I only need an expand-right algorithm, which is simpler. The example that I described below is still relevant.
Original Post:
I'm trying to figure out how to do either an inverse sheep-and-goats operation or, even better, an expand-right-flip.
According to Hacker's Delight, a sheeps-and-goats operation can be represented by:
SAG(x, m) = compress_left(x, m) | compress(x, ~m)
According to this site, the inverse can be found by:
INV_SAG(x, m, sw) = expand_left(x, ~m, sw) | expand_right(x, m, sw)
However, I can't find any code for the expand_left and expand_right functions. They are, of course, the inverse functions for compress, but compress is kind of hard to understand in itself.
Example:
To better explain what I'm looking for, consider a set of 8 bits like:
0000abcd
The variables a, b, c and d may be either ones or zeros. In addition, there is a mask which repositions the bits. So for example, if the mask were 01100101, the resulting bits would be repositioned as follows:
0ab00c0d
This can be done with an inverse sheeps-and-goats operation. However, according to this section of the site mentioned above, there is a more efficient way which he refers to as the expand-right-flip. Looking at his site, I was unable to figure out how that can be done.
Here's the expand_right from Hacker's Delight, it just says expand but it's the right version.
unsigned expand(unsigned x, unsigned m) {
unsigned m0, mk, mp, mv, t;
unsigned array[5];
int i;
m0 = m; // Save original mask.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel suffix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
array[i] = mv;
m = (m ^ mv) | (mv >> (1 << i)); // Compress m.
mk = mk & ~mp;
}
for (i = 4; i >= 0; i--) {
mv = array[i];
t = x << (1 << i);
x = (x & ~mv) | (t & mv);
}
return x & m0; // Clear out extraneous bits.
}
You can use expand_left(x, m) == expand_right(x >> (32 - popcnt(m)), m) to make the left version, but that's probably not the best way.

clear all but the two most significant set bits in a word

Given an 32 bit int which is known to have at least 2 bits set, is there a way to efficiently clear all except the 2 most significant set bits? i.e. I want to ensure the output has exactly 2 bits set.
What if the input is guaranteed to have only 2 or 3 bits set.?
Examples:
0x2040 -> 0x2040
0x0300 -> 0x0300
0x0109 -> 0x0108
0x5040 -> 0x5000
Benchmarking Results:
Code:
QueryPerformanceFrequency(&freq);
/***********/
value = (base =2)|1;
QueryPerformanceCounter(&start);
for (l=0;l<A_LOT; l++)
{
//!!value calculation goes here
junk+=value; //use result to prevent optimizer removing it.
//advance to the next 2|3 bit word
if (value&0x80000000)
{ if (base&0x80000000)
{ base=6;
}
base*=2;
value=base|1;
}
else
{ value<<=1;
}
}
QueryPerformanceCounter(&end);
time = (end.QuadPart - start.QuadPart);
time /= freq.QuadPart;
printf("--------- name\n");
printf("%ld loops took %f sec (%f additional)\n",A_LOT, time, time-baseline);
printf("words /sec = %f Million\n",A_LOT/(time-baseline)/1.0e6);
Results on using VS2005 default release settings on Core2Duo E7500#2.93 GHz:
--------- BASELINE
1000000 loops took 0.001630 sec
--------- sirgedas
1000000 loops took 0.002479 sec (0.000849 additional)
words /sec = 1178.074206 Million
--------- ashelly
1000000 loops took 0.004640 sec (0.003010 additional)
words /sec = 332.230369 Million
--------- mvds
1000000 loops took 0.005250 sec (0.003620 additional)
words /sec = 276.242030 Million
--------- spender
1000000 loops took 0.009594 sec (0.007964 additional)
words /sec = 125.566361 Million
--------- schnaader
1000000 loops took 0.025680 sec (0.024050 additional)
words /sec = 41.580158 Million
If the input is guaranteed to have exactly 2 or 3 bits then the answer can be computed very quickly. We exploit the fact that the expression x&(x-1) is equal to x with the LSB cleared. Applying that expression twice to the input will produce 0, if 2 or fewer bits are set. If exactly 2 bits are set, we return the original input. Otherwise, we return the original input with the LSB cleared.
Here is the code in C++:
// assumes a has exactly 2 or 3 bits set
int topTwoBitsOf( int a )
{
int b = a&(a-1); // b = a with LSB cleared
return b&(b-1) ? b : a; // check if clearing the LSB of b produces 0
}
This can be written as a confusing single expression, if you like:
int topTwoBitsOf( int a )
{
return a&(a-1)&((a&(a-1))-1) ? a&(a-1) : a;
}
I'd create a mask in a loop. At the beginning, the mask is 0. Then go from the MSB to the LSB and set each corresponding bit in the mask to 1 until you found 2 set bits. Finally AND the value with this mask.
#include <stdio.h>
#include <stdlib.h>
int clear_bits(int value) {
unsigned int mask = 0;
unsigned int act_bit = 0x80000000;
unsigned int bit_set_count = 0;
do {
if ((value & act_bit) == act_bit) bit_set_count++;
mask = mask | act_bit;
act_bit >>= 1;
} while ((act_bit != 0) && (bit_set_count < 2));
return (value & mask);
}
int main() {
printf("0x2040 => %X\n", clear_bits(0x2040));
printf("0x0300 => %X\n", clear_bits(0x0300));
printf("0x0109 => %X\n", clear_bits(0x0109));
printf("0x5040 => %X\n", clear_bits(0x5040));
return 0;
}
This is quite complicated, but should be more efficient as using a for loop over the 32 bits every time (and clear all bits except the 2 most significant set ones). Anyway, be sure to benchmark different ways before using one.
Of course, if memory is not a problem, use a lookup table approach like some recommended - this will be much faster.
how much memory is available at what latency? I would propose a lookup table ;-)
but seriously: if you would perform this on 100s of numbers, an 8 bit lookup table giving 2 msb and another 8 bit lookup table giving 1 msb may be all you need. Depending on the processor this might beat really counting bits.
For speed, I would create a lookup table mapping an input byte to
M(I)=0 if 1 or 0 bits set
M(I)=B' otherwise, where B' is the value of B with the 2 msb bits set.
Your 32 bit int are 4 input bytes I1 I2 I3 I4.
Lookup M(I1), if nonzero, you're done.
Compare M(I1)==0, if zero, repeat previous step for I2.
Else, lookup I2 in a second lookup table with 1 MSB bits, if nonzero, you're done.
Else, repeat previous step for I3.
etc etc. Don't actually loop anything over I1-4 but unroll it fully.
Summing up: 2 lookup tables with 256 entries, 247/256 of cases are resolved with one lookup, approx 8/256 with two lookups, etc.
edit: the tables, for clarity (input, bits table 2 MSB, bits table 1 MSB)
I table2 table1
0 00000000 00000000
1 00000000 00000001
2 00000000 00000010
3 00000011 00000010
4 00000000 00000100
5 00000101 00000100
6 00000110 00000100
7 00000110 00000100
8 00000000 00001000
9 00001001 00001000
10 00001010 00001000
11 00001010 00001000
12 00001100 00001000
13 00001100 00001000
14 00001100 00001000
15 00001100 00001000
16 00000000 00010000
17 00010001 00010000
18 00010010 00010000
19 00010010 00010000
20 00010100 00010000
..
250 11000000 10000000
251 11000000 10000000
252 11000000 10000000
253 11000000 10000000
254 11000000 10000000
255 11000000 10000000
Here's another attempt (no loops, no lookup, no conditionals). This time it works:
var orig=0x109;
var x=orig;
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x = orig & ~(x & ~(x >> 1));
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
var solution=orig & ~(x >> 1);
Console.WriteLine(solution.ToString("X")); //0x108
Could probably be shortened by someone cleverer than me.
Following up on my previous answer, here's the complete implementation. I think it is as fast as it can get. (sorry for unrolling the whole thing ;-)
#include <stdio.h>
unsigned char bittable1[256];
unsigned char bittable2[256];
unsigned int lookup(unsigned int);
void gentable(void);
int main(int argc,char**argv)
{
unsigned int challenge = 0x42341223, result;
gentable();
if ( argc > 1 ) challenge = atoi(argv[1]);
result = lookup(challenge);
printf("%08x --> %08x\n",challenge,result);
}
unsigned int lookup(unsigned int i)
{
unsigned int ret;
ret = bittable2[i>>24]<<24; if ( ret ) return ret;
ret = bittable1[i>>24]<<24;
if ( !ret )
{
ret = bittable2[i>>16]<<16; if ( ret ) return ret;
ret = bittable1[i>>16]<<16;
if ( !ret )
{
ret = bittable2[i>>8]<<8; if ( ret ) return ret;
ret = bittable1[i>>8]<<8;
if ( !ret )
{
return bittable2[i] | bittable1[i];
} else {
return (ret | bittable1[i&0xff]);
}
} else {
if ( bittable1[(i>>8)&0xff] )
{
return (ret | (bittable1[(i>>8)&0xff]<<8));
} else {
return (ret | bittable1[i&0xff]);
}
}
} else {
if ( bittable1[(i>>16)&0xff] )
{
return (ret | (bittable1[(i>>16)&0xff]<<16));
} else if ( bittable1[(i>>8)&0xff] ) {
return (ret | (bittable1[(i>>8)&0xff]<<8));
} else {
return (ret | (bittable1[i&0xff]));
}
}
}
void gentable()
{
int i;
for ( i=0; i<256; i++ )
{
int bitset = 0;
int j;
for ( j=128; j; j>>=1 )
{
if ( i&j )
{
bitset++;
if ( bitset == 1 ) bittable1[i] = i&(~(j-1));
else if ( bitset == 2 ) bittable2[i] = i&(~(j-1));
}
}
//printf("%3d %02x %02x\n",i,bittable1[i],bittable2[i]);
}
}
Using a variation of this, I came up with the following:
var orig=56;
var x=orig;
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
Console.WriteLine(orig&~(x>>2));
In c# but should translate easily.
EDIT
I'm not so sure I've answered your question. This takes the highest bit and preserves it and the bit next to it, eg. 101 => 100
Here's some python that should work:
def bit_play(num):
bits_set = 0
upper_mask = 0
bit_index = 31
while bit_index >= 0:
upper_mask |= (1 << bit_index)
if num & (1 << bit_index) != 0:
bits_set += 1
if bits_set == 2:
num &= upper_mask
break
bit_index -= 1
return num
It makes one pass over the number. It builds a mask of the bits that it crosses so it can mask off the bottom bits as soon as it hits the second-most significant one. As soon as it finds the second bit, it proceeds to clear the lower bits. You should be able to create a mask of the upper bits and &= it in instead of the second while loop. Maybe I'll hack that in and edit the post.
I'd also use a table based approach, but I believe one table alone should be sufficient. Take the 4 bit case as an example. If you're input is guaranteed to have 2 or 3 bits, then your output can only be one of 6 values
0011
0101
0110
1001
1010
1100
Put these possible values in an array sorted by size. Starting with the largest, find the first value which is equal to or less than your target value. This is your answer. For the 8 bit version you'll have more possible return values, but still easily less than the maximum possible permutations of 8*7.
public static final int [] MASKS = {
0x03, //0011
0x05, //0101
0x06, //0110
0x09, //1001
0x0A, //1010
0x0C, //1100
};
for (int i = 0; i < 16; ++i) {
if (countBits(i) < 2) {
continue;
}
for (int j = MASKS.length - 1; j >= 0; --j) {
if (MASKS[j] <= i) {
System.out.println(Integer.toBinaryString(i) + " " + Integer.toBinaryString(MASKS[j]));
break;
}
}
}
Here's my implementation in C#
uint OnlyMostSignificant(uint value, int count) {
uint newValue = 0;
int c = 0;
for(uint high = 0x80000000; high != 0 && c < count; high >>= 1) {
if ((value & high) != 0) {
newValue = newValue | high;
c++;
}
}
return newValue;
}
Using count, you could make it the most significant (count) bits.
My solution:
Use "The best method for counting bits in a 32-bit integer", then clear the lower bit if the answer is 3. Only works when input is limited to 2 or 3 bits set.
unsigned int c; // c is the total bits set in v
unsigned int v = value;
v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
crc+=value&value-(c-2);

Multiplication of very long integers

Is there an algorithm for accurately multiplying two arbitrarily long integers together? The language I am working with is limited to 64-bit unsigned integer length (maximum integer size of 18446744073709551615). Realistically, I would like to be able to do this by breaking up each number, processing them somehow using the unsigned 64-bit integers, and then being able to put them back together in to a string (which would solve the issue of multiplied result storage).
Any ideas?
Most languages have functions or libraries that do this, usually called a Bignum library (GMP is a good one.)
If you want to do it yourself, I would do it the same way that people do long multiplication on paper. To do this you could either work with strings containing the number, or do it in binary using bitwise operations.
Example:
45
x67
---
315
+270
----
585
Or in binary:
101
x101
----
101
000
+101
------
11001
Edit: After doing it in binary I realized that it would be much simpler (and faster of course) to code using bitwise operations instead of strings containing the base-10 numbers. I've edited my binary multiplying example to show a pattern: for each 1-bit in the bottom number, add the top number, bit-shifted left the position of the 1-bit times to a variable. At the end, that variable will contain the product.
To store the product, you'll have to have two 64-bit numbers and imagine one of them being the first 64 bits and the other one the second 64 bits of the product. You'll have to write code that carries the addition from bit 63 of the second number to bit 0 of the first number.
If you can't use an existing bignum library like GMP, check out Wikipedia's article on binary multiplication with computers. There are a number of good, efficient algorithms for this.
The simplest way would be to use the schoolbook mechanism, splitting your arbitrarily sized numbers into chunks of 32-bit each.
Given A B C D * E F G H (each chunk 32-bit, for a total 128 bit)
You need an output array 9 dwords wide.
Set Out[0..8] to 0
You'd start by doing: H * D + out[8] => 64 bit result.
Store the low 32-bits in out[8] and take the high 32-bits as carry
Next: (H * C) + out[7] + carry
Again, store low 32-bit in out[7], use the high 32-bits as carry
after doing H*A + out[4] + carry, you need to continue looping until you have no carry.
Then repeat with G, F, E.
For G, you'd start at out[7] instead of out[8], and so forth.
Finally, walk through and convert the large integer into digits (which will require a "divide large number by a single word" routine)
Yes, you do it using a datatype that is effectively a string of digits (just like a normal 'string' is a string of characters). How you do this is highly language-dependent. For instance, Java uses BigDecimal. What language are you using?
This is often given as a homework assignment. The algorithm you learned in grade school will work. Use a library (several are mentioned in other posts) if you need this for a real application.
Here is my code piece in C. Good old multiply method
char *multiply(char s1[], char s2[]) {
int l1 = strlen(s1);
int l2 = strlen(s2);
int i, j, k = 0, c = 0;
char *r = (char *) malloc (l1+l2+1); // add one byte for the zero terminating string
int temp;
strrev(s1);
strrev(s2);
for (i = 0;i <l1+l2; i++) {
r[i] = 0 + '0';
}
for (i = 0; i <l1; i ++) {
c = 0; k = i;
for (j = 0; j < l2; j++) {
temp = get_int(s1[i]) * get_int(s2[j]);
temp = temp + c + get_int(r[k]);
c = temp /10;
r[k] = temp%10 + '0';
k++;
}
if (c!=0) {
r[k] = c + '0';
k++;
}
}
r[k] = '\0';
strrev(r);
return r;
}
//Here is a JavaScript version of an Karatsuba Algorithm running with less time than the usual multiplication method
function range(start, stop, step) {
if (typeof stop == 'undefined') {
// one param defined
stop = start;
start = 0;
}
if (typeof step == 'undefined') {
step = 1;
}
if ((step > 0 && start >= stop) || (step < 0 && start <= stop)) {
return [];
}
var result = [];
for (var i = start; step > 0 ? i < stop : i > stop; i += step) {
result.push(i);
}
return result;
};
function zeroPad(numberString, zeros, left = true) {
//Return the string with zeros added to the left or right.
for (var i in range(zeros)) {
if (left)
numberString = '0' + numberString
else
numberString = numberString + '0'
}
return numberString
}
function largeMultiplication(x, y) {
x = x.toString();
y = y.toString();
if (x.length == 1 && y.length == 1)
return parseInt(x) * parseInt(y)
if (x.length < y.length)
x = zeroPad(x, y.length - x.length);
else
y = zeroPad(y, x.length - y.length);
n = x.length
j = Math.floor(n/2);
//for odd digit integers
if ( n % 2 != 0)
j += 1
var BZeroPadding = n - j
var AZeroPadding = BZeroPadding * 2
a = parseInt(x.substring(0,j));
b = parseInt(x.substring(j));
c = parseInt(y.substring(0,j));
d = parseInt(y.substring(j));
//recursively calculate
ac = largeMultiplication(a, c)
bd = largeMultiplication(b, d)
k = largeMultiplication(a + b, c + d)
A = parseInt(zeroPad(ac.toString(), AZeroPadding, false))
B = parseInt(zeroPad((k - ac - bd).toString(), BZeroPadding, false))
return A + B + bd
}
//testing the function here
example = largeMultiplication(12, 34)
console.log(example)

Resources