Originally this post requested an inverse sheep-and-goats operation, but I realized that it was more than I really needed, so I edited the title, because I only need an expand-right algorithm, which is simpler. The example that I described below is still relevant.
Original Post:
I'm trying to figure out how to do either an inverse sheep-and-goats operation or, even better, an expand-right-flip.
According to Hacker's Delight, a sheeps-and-goats operation can be represented by:
SAG(x, m) = compress_left(x, m) | compress(x, ~m)
According to this site, the inverse can be found by:
INV_SAG(x, m, sw) = expand_left(x, ~m, sw) | expand_right(x, m, sw)
However, I can't find any code for the expand_left and expand_right functions. They are, of course, the inverse functions for compress, but compress is kind of hard to understand in itself.
Example:
To better explain what I'm looking for, consider a set of 8 bits like:
0000abcd
The variables a, b, c and d may be either ones or zeros. In addition, there is a mask which repositions the bits. So for example, if the mask were 01100101, the resulting bits would be repositioned as follows:
0ab00c0d
This can be done with an inverse sheeps-and-goats operation. However, according to this section of the site mentioned above, there is a more efficient way which he refers to as the expand-right-flip. Looking at his site, I was unable to figure out how that can be done.
Here's the expand_right from Hacker's Delight, it just says expand but it's the right version.
unsigned expand(unsigned x, unsigned m) {
unsigned m0, mk, mp, mv, t;
unsigned array[5];
int i;
m0 = m; // Save original mask.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel suffix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
array[i] = mv;
m = (m ^ mv) | (mv >> (1 << i)); // Compress m.
mk = mk & ~mp;
}
for (i = 4; i >= 0; i--) {
mv = array[i];
t = x << (1 << i);
x = (x & ~mv) | (t & mv);
}
return x & m0; // Clear out extraneous bits.
}
You can use expand_left(x, m) == expand_right(x >> (32 - popcnt(m)), m) to make the left version, but that's probably not the best way.
Related
For a noise shader I'm looking for a pseudo random number algorithm with 3d vector argument, i.e.,
for every integer vector it returns a value in [0,1].
It should be as fast as possible without producing visual artifacts and giving the same results on every GPU.
Two variants (pseudo code) I found are
rand1(vec3 (x,y,z)){
return xorshift32(x ^ xorshift32(y ^ xorshift32(z)));
}
which already uses 20 arithmetic operations and still has to be casted and normalized and
rand2(vec3 v){
return fract(sin(dot(v, vec3(12.9898, 78.233, ?))) * 43758.5453);
};
which might be faster but uses sin causing precision problems and different results on different GPU's.
Do you know any other algorithms requiring less arithmetic operations?
Thanks in advance.
Edit: Another standard PRNG is XORWOW as implemented in C as
xorwow() {
int x = 123456789, y = 362436069, z = 521288629, w = 88675123, v = 5783321, d = 6615241;
int t = (x ^ (x >> 2));
x = y;
y = z;
z = w;
w = v;
v = (v ^ (v << 4)) ^ (t ^ (t << 1));
return (d += 362437) + v;
}
Can we rewrite it to fit in our context?
So I need to solve for the linear system (A + i * mu * I) x = b, where A is dense Hermitian matrix (6x6 complex numbers), mu is a real scalar, and I is identity matrix.
Obviously if mu=0, I should just use Cholesky and be done with it. With non-zero mu though, the matrix ceases to be Hermitian and Cholesky fails.
Possible solutions:
Solve normal operator using Cholesky and multiply by the conjugate
Solve directly using LU decomposition
This is in a time-critical performance routine, where I need the most efficient method. Any thoughts on the optimum approach, or if there is a specific method for solving the above shifted Hermitian system?
This is to be deployed in a CUDA kernel, where I'll be solving many linear systems in parallel, e.g., one per thread. This means that I need a solution that minimizes thread divergence. Given the small system size, pivoting can be ignored without too much issue: this removes a possible source of thread divergence. I've already implemented an in-place Cholesky normal method, and while it's working ok, the performance isn't great in double precision.
I can't vouch for the stability of the method below, but if your matrix is reasonably well conditioned, it might be worth a try.
We want to solve
A*X = B
If we pick out the first row and column, say
A = ( a y )
( z A_ )
X = ( x )
( X_)
B = ( b )
( B_ )
The requirement is
a*x + y*X_ = b
z*x + A_*X_ = B_
so
x = (b - y*X_ )/a
(A_ - zy/a) * X_ = B_ - (b/a)z
The solution goes in two stages. First use the second equation to transform A and b, then use the second to form the solution x.
In C:
static void nhsol( int dim, complx* A, complx* B, complx* X)
{
int i, j, k;
complx a, fb, fa;
complx* z;
complx* acol;
// update A and B
for( i=0; i<dim; ++i)
{ z = A + i*dim;
a = z[i];
// update B
fb = B[i]/a;
for( j=i+1; j<dim; ++j)
{ B[j] -= fb*z[j];
}
// update A
for( k=i+1; k<dim; ++k)
{ acol = A + k*dim;
fa = acol[i]/a;
for( j=i+1; j<dim; ++j)
{ acol[j] -= fa*z[j];
}
}
}
// compute x
i = dim-1;
X[i] = B[i] / A[i+dim*i];
while( --i>=0)
{
complx s = B[i];
for( j=i+1; j<dim; ++j)
{ s -= A[i+j*dim]*X[j];
}
X[i] = s/A[i+i*dim];
}
}
where
typedef _Complex double complx;
If code space is not at a premuim it might be worth unrolling the loops. Personally I would do this by writing a program whose sole job was to write the code.
I forgot a bit hack to generate all integers with a given number of 1s. Does anybody remember it (and probably can explain it also)?
From Bit Twiddling Hacks
Update Test program Live On Coliru
#include <utility>
#include <iostream>
#include <bitset>
using I = uint8_t;
auto dump(I v) { return std::bitset<sizeof(I) * __CHAR_BIT__>(v); }
I bit_twiddle_permute(I v) {
I t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
I w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
return w;
}
int main() {
I p = 0b001001;
std::cout << dump(p) << "\n";
for (I n = bit_twiddle_permute(p); n>p; p = n, n = bit_twiddle_permute(p)) {
std::cout << dump(n) << "\n";
}
}
Prints
00001001
00001010
00001100
00010001
00010010
00010100
00011000
00100001
00100010
00100100
00101000
00110000
01000001
01000010
01000100
01001000
01010000
01100000
10000001
10000010
10000100
10001000
10010000
10100000
11000000
Compute the lexicographically next bit permutation
Suppose we have a pattern of N bits set to 1 in an integer and we want the next permutation of N 1 bits in a lexicographical sense. For example, if N is 3 and the bit pattern is 00010011, the next patterns would be 00010101, 00010110, 00011001,00011010, 00011100, 00100011, and so forth. The following is a fast way to compute the next permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros. If you are using Microsoft compilers for x86, the intrinsic is _BitScanForward. These both emit a bsf instruction, but equivalents may be available for other architectures. If not, then consider using one of the methods for counting the consecutive zero bits mentioned earlier.
Here is another version that tends to be slower because of its division operator, but it
does not require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.
For bit hacks I like to refer to this page: Bit Twiddling Hacks.
Regarding your specific question, read the part entitled Compute the lexicographically next bit permutation.
Compute the lexicographically next bit permutation
Suppose we have a pattern of N bits set to 1 in an integer and we want the next permutation of N 1 bits in a lexicographical sense. For example, if N is 3 and the bit pattern is 00010011, the next patterns would be 00010101, 00010110, 00011001,00011010, 00011100, 00100011, and so forth. The following is a fast way to compute the next permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros. If you are using Microsoft compilers for x86, the intrinsic is _BitScanForward. These both emit a bsf instruction, but equivalents may be available for other architectures. If not, then consider using one of the methods for counting the consecutive zero bits mentioned earlier.
Here is another version that tends to be slower because of its division operator, but it does not require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.
To add onto #sehe's answer included below (originally from Dario Sneidermanis also at http://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation.)
#include <utility>
#include <iostream>
#include <bitset>
using I = uint8_t;
auto dump(I v) { return std::bitset<sizeof(I) * __CHAR_BIT__>(v); }
I bit_twiddle_permute(I v) {
I t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
I w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
return w;
}
int main() {
I p = 0b001001;
std::cout << dump(p) << "\n";
for (I n = bit_twiddle_permute(p); n>p; p = n, n = bit_twiddle_permute(p))
{
std::cout << dump(n) << "\n";
}
}
There are boundary issues with bit_twiddle_permute(I v). Whenever v is the last permutation, t is all 1's (e.g. 2^8 - 1), (~t & -~t) = 0, and w is the first permutation of bits with one fewer 1s than v, except when v = 000000000 in which case w = 01111111. In particular if you set p to 0; the loop in main will produce all permutations with seven 1's, and the following slight modification of the for loop, will cycle through all permutations with 0, 7, 6, ..., 1 bits set -
for (I n = bit_twiddle_permute(p); n>p; n = bit_twiddle_permute(n))
If this is the intention, it is perhaps worth a comment. If not it is trivial to fix, e.g.
if (t == (I)(-1)) { return v >> __builtin_ctz(v); }
So with an additional small simplification
I bit_twiddle_permute2(I v) {
I t = (v | (v - 1)) + 1;
if (t == 0) { return v >> __builtin_ctz(v); }
I w = t | ((~t & v) >> (__builtin_ctz(v) + 1));
return w;
}
int main() {
I p = 0b1;
cout << dump(p) << "\n";
for (I n = bit_twiddle_permute2(p); n>p; n = bit_twiddle_permute2(n)) {
cout << dump(n) << "\n";
}
}
The following adaptation of Dario Sneidermanis's idea may be slightly easier to follow
I bit_twiddle_permute3(I v) {
int n = __builtin_ctz(v);
I s = v >> n;
I t = s + 1;
I w = (t << n) | ((~t & s) >> 1);
return w;
}
or with a similar solution to the issue I mentioned at the beginning of this post
I bit_twiddle_permute3(I v) {
int n = __builtin_ctz(v);
I s = v >> n;
I t = s + 1;
if (v == 0 || t << n == 0) { return s; }
I w = (t << n) | ((~t & s) >> 1);
return w;
}
Is there an algorithm for accurately multiplying two arbitrarily long integers together? The language I am working with is limited to 64-bit unsigned integer length (maximum integer size of 18446744073709551615). Realistically, I would like to be able to do this by breaking up each number, processing them somehow using the unsigned 64-bit integers, and then being able to put them back together in to a string (which would solve the issue of multiplied result storage).
Any ideas?
Most languages have functions or libraries that do this, usually called a Bignum library (GMP is a good one.)
If you want to do it yourself, I would do it the same way that people do long multiplication on paper. To do this you could either work with strings containing the number, or do it in binary using bitwise operations.
Example:
45
x67
---
315
+270
----
585
Or in binary:
101
x101
----
101
000
+101
------
11001
Edit: After doing it in binary I realized that it would be much simpler (and faster of course) to code using bitwise operations instead of strings containing the base-10 numbers. I've edited my binary multiplying example to show a pattern: for each 1-bit in the bottom number, add the top number, bit-shifted left the position of the 1-bit times to a variable. At the end, that variable will contain the product.
To store the product, you'll have to have two 64-bit numbers and imagine one of them being the first 64 bits and the other one the second 64 bits of the product. You'll have to write code that carries the addition from bit 63 of the second number to bit 0 of the first number.
If you can't use an existing bignum library like GMP, check out Wikipedia's article on binary multiplication with computers. There are a number of good, efficient algorithms for this.
The simplest way would be to use the schoolbook mechanism, splitting your arbitrarily sized numbers into chunks of 32-bit each.
Given A B C D * E F G H (each chunk 32-bit, for a total 128 bit)
You need an output array 9 dwords wide.
Set Out[0..8] to 0
You'd start by doing: H * D + out[8] => 64 bit result.
Store the low 32-bits in out[8] and take the high 32-bits as carry
Next: (H * C) + out[7] + carry
Again, store low 32-bit in out[7], use the high 32-bits as carry
after doing H*A + out[4] + carry, you need to continue looping until you have no carry.
Then repeat with G, F, E.
For G, you'd start at out[7] instead of out[8], and so forth.
Finally, walk through and convert the large integer into digits (which will require a "divide large number by a single word" routine)
Yes, you do it using a datatype that is effectively a string of digits (just like a normal 'string' is a string of characters). How you do this is highly language-dependent. For instance, Java uses BigDecimal. What language are you using?
This is often given as a homework assignment. The algorithm you learned in grade school will work. Use a library (several are mentioned in other posts) if you need this for a real application.
Here is my code piece in C. Good old multiply method
char *multiply(char s1[], char s2[]) {
int l1 = strlen(s1);
int l2 = strlen(s2);
int i, j, k = 0, c = 0;
char *r = (char *) malloc (l1+l2+1); // add one byte for the zero terminating string
int temp;
strrev(s1);
strrev(s2);
for (i = 0;i <l1+l2; i++) {
r[i] = 0 + '0';
}
for (i = 0; i <l1; i ++) {
c = 0; k = i;
for (j = 0; j < l2; j++) {
temp = get_int(s1[i]) * get_int(s2[j]);
temp = temp + c + get_int(r[k]);
c = temp /10;
r[k] = temp%10 + '0';
k++;
}
if (c!=0) {
r[k] = c + '0';
k++;
}
}
r[k] = '\0';
strrev(r);
return r;
}
//Here is a JavaScript version of an Karatsuba Algorithm running with less time than the usual multiplication method
function range(start, stop, step) {
if (typeof stop == 'undefined') {
// one param defined
stop = start;
start = 0;
}
if (typeof step == 'undefined') {
step = 1;
}
if ((step > 0 && start >= stop) || (step < 0 && start <= stop)) {
return [];
}
var result = [];
for (var i = start; step > 0 ? i < stop : i > stop; i += step) {
result.push(i);
}
return result;
};
function zeroPad(numberString, zeros, left = true) {
//Return the string with zeros added to the left or right.
for (var i in range(zeros)) {
if (left)
numberString = '0' + numberString
else
numberString = numberString + '0'
}
return numberString
}
function largeMultiplication(x, y) {
x = x.toString();
y = y.toString();
if (x.length == 1 && y.length == 1)
return parseInt(x) * parseInt(y)
if (x.length < y.length)
x = zeroPad(x, y.length - x.length);
else
y = zeroPad(y, x.length - y.length);
n = x.length
j = Math.floor(n/2);
//for odd digit integers
if ( n % 2 != 0)
j += 1
var BZeroPadding = n - j
var AZeroPadding = BZeroPadding * 2
a = parseInt(x.substring(0,j));
b = parseInt(x.substring(j));
c = parseInt(y.substring(0,j));
d = parseInt(y.substring(j));
//recursively calculate
ac = largeMultiplication(a, c)
bd = largeMultiplication(b, d)
k = largeMultiplication(a + b, c + d)
A = parseInt(zeroPad(ac.toString(), AZeroPadding, false))
B = parseInt(zeroPad((k - ac - bd).toString(), BZeroPadding, false))
return A + B + bd
}
//testing the function here
example = largeMultiplication(12, 34)
console.log(example)
I have an application where a Hilbert R-Tree (wikipedia) (citeseer) would seem to be an appropriate data structure. Specifically, it requires reasonably fast spatial queries over a data set that will experience a lot of updates.
However, as far as I can see, none of the descriptions of the algorithms for this data structure even mention how to actually calculate the requisite Hilbert Value; which is the distance along a Hilbert Curve to the point.
So any suggestions for how to go about calculating this?
Fun question!
I did a bit of googling, and the good news is, I've found an implementation of Hilbert Value.
The potentially bad news is, it's in Haskell...
http://www.serpentine.com/blog/2007/01/11/two-dimensional-spatial-hashing-with-space-filling-curves/
It also proposes a Lebesgue distance metric you might be able to compute more easily.
Below is my java code adapted from C code in the paper "Encoding and decoding the Hilbert order" by Xian Lu and Gunther Schrack, published in Software: Practice and Experience Vol. 26 pp 1335-46 (1996).
Hope this helps. Improvements welcome !
Michael
/**
* Find the Hilbert order (=vertex index) for the given grid cell
* coordinates.
* #param x cell column (from 0)
* #param y cell row (from 0)
* #param r resolution of Hilbert curve (grid will have Math.pow(2,r)
* rows and cols)
* #return Hilbert order
*/
public static int encode(int x, int y, int r) {
int mask = (1 << r) - 1;
int hodd = 0;
int heven = x ^ y;
int notx = ~x & mask;
int noty = ~y & mask;
int temp = notx ^ y;
int v0 = 0, v1 = 0;
for (int k = 1; k < r; k++) {
v1 = ((v1 & heven) | ((v0 ^ noty) & temp)) >> 1;
v0 = ((v0 & (v1 ^ notx)) | (~v0 & (v1 ^ noty))) >> 1;
}
hodd = (~v0 & (v1 ^ x)) | (v0 & (v1 ^ noty));
return interleaveBits(hodd, heven);
}
/**
* Interleave the bits from two input integer values
* #param odd integer holding bit values for odd bit positions
* #param even integer holding bit values for even bit positions
* #return the integer that results from interleaving the input bits
*
* #todo: I'm sure there's a more elegant way of doing this !
*/
private static int interleaveBits(int odd, int even) {
int val = 0;
// Replaced this line with the improved code provided by Tuska
// int n = Math.max(Integer.highestOneBit(odd), Integer.highestOneBit(even));
int max = Math.max(odd, even);
int n = 0;
while (max > 0) {
n++;
max >>= 1;
}
for (int i = 0; i < n; i++) {
int bitMask = 1 << i;
int a = (even & bitMask) > 0 ? (1 << (2*i)) : 0;
int b = (odd & bitMask) > 0 ? (1 << (2*i+1)) : 0;
val += a + b;
}
return val;
}
See uzaygezen.
The code and java code above are fine for 2D data points. But for higher dimensions you may need to look at Jonathan Lawder's paper: J.K.Lawder. Calculation of Mappings Between One and n-dimensional Values Using the Hilbert Space-filling Curve.
I figured out a slightly more efficient way to interleave bits. It can be found at the Stanford Graphics Website. I included a version that I created that can interleave two 32 bit integers into one 64 bit long.
public static long spreadBits32(int y) {
long[] B = new long[] {
0x5555555555555555L,
0x3333333333333333L,
0x0f0f0f0f0f0f0f0fL,
0x00ff00ff00ff00ffL,
0x0000ffff0000ffffL,
0x00000000ffffffffL
};
int[] S = new int[] { 1, 2, 4, 8, 16, 32 };
long x = y;
x = (x | (x << S[5])) & B[5];
x = (x | (x << S[4])) & B[4];
x = (x | (x << S[3])) & B[3];
x = (x | (x << S[2])) & B[2];
x = (x | (x << S[1])) & B[1];
x = (x | (x << S[0])) & B[0];
return x;
}
public static long interleave64(int x, int y) {
return spreadBits32(x) | (spreadBits32(y) << 1);
}
Obviously, the B and S local variables should be class constants but it was left this way for simplicity.
Michael,
thanks for your Java code! I tested it and it seems to work fine, but I noticed that the bit-interleaving function overflows at recursion level 7 (at least in my tests, but I used long values), because the "n"-value is calculated using highestOneBit()-function, which returns the value and not the position of the highest one bit; so the loop does unnecessarily many interleavings.
I just changed it to the following snippet, and after that it worked fine.
int max = Math.max(odd, even);
int n = 0;
while (max > 0) {
n++;
max >>= 1;
}
If you need a spatial index with fast delete/insert capabilities, have a look at the PH-tree. It partly based on quadtrees but faster and more space efficient. Internally it uses a Z-curve which has slightly worse spatial properties than an H-curve but is much easier to calculate.
Paper: http://www.globis.ethz.ch/script/publication/download?docid=699
Java implementation: http://globis.ethz.ch/files/2014/11/ph-tree-2014-11-10.zip
Another option is the X-tree, which is also available here:
https://code.google.com/p/xxl/
Suggestion: A good simple efficient data structure for spatial queries is a multidimensional binary tree.
In a traditional binary tree, there is one "discriminant"; the value that's used to determine whether you take the left branch or the right branch. This can be considered to be the one-dimensional case.
In a multidimensional binary tree, you have multiple discriminants; consecutive levels use different discriminants. For example, for two dimensional spacial data, you could use the X and Y coordinates as discriminants. Consecutive levels would use X, Y, X, Y...
For spatial queries (for example finding all nodes within a rectangle) you do a depth-first search of the tree starting at the root, and you use the discriminant at each level to avoid searching down branches that contain no nodes in the given rectangle.
This allows you to potentially cut the search space in half at each level, making it very efficient for finding small regions in a massive data set. (BTW, this data structure is also useful for partial-match queries, i.e. queries that omit one or more discriminants. You just search down both branches at levels with an omitted discriminant.)
A good paper on this data structure: http://portal.acm.org/citation.cfm?id=361007
This article has good diagrams and algorithm descriptions: http://en.wikipedia.org/wiki/Kd-tree