how to swap two numbers inplace without using any additional space?
You can do it using XOR operator as:
if( x != y) { // this check is very important.
x ^= y;
y ^= x;
x ^= y;
}
EDIT:
Without the additional check the above logic fails to swap the number with itself.
Example:
int x = 10;
if I apply the above logic to swap x with itself, without the check I end up having x=0, which is incorrect.
Similarly if I put the logic without the check in a function and call the function to swap two references to the same variable, it fails.
If you have 2 variables a and b: (each variable occupies its own memory address)
a = a xor b
b = a xor b
a = a xor b
There are also some other variations to this problem but they will fail if there is overflow:
a=a+b
b=a-b
a=a-b
a=a*b
b=a/b
a=a/b
The plus and minus variation may work if you have custom types that have + and - operators that make sense.
Note: To avoid confusion, if you have only 1 variable, and 2 references or pointers to it, then all of the above will fail. A check should be made to avoid this.
Unlike a lot of people are saying it does not matter if you have 2 different numbers. It only matters that you have 2 distinct variables where the number exists in 2 different memory addresses.
I.e. this is perfectly valid:
int a = 3;
int b = 3;
a = a ^ b;
b = a ^ b;
a = a ^ b;
assert(a == b);
assert(a == 3);
The xor trick is the standard answer:
int x, y;
x ^= y;
y ^= x;
x ^= y;
xoring is considerably less clear than just using a temp, though, and it fails if x and y are the same location
Since no langauge was mentioned, in Python:
y, x = x, y
Related
For a noise shader I'm looking for a pseudo random number algorithm with 3d vector argument, i.e.,
for every integer vector it returns a value in [0,1].
It should be as fast as possible without producing visual artifacts and giving the same results on every GPU.
Two variants (pseudo code) I found are
rand1(vec3 (x,y,z)){
return xorshift32(x ^ xorshift32(y ^ xorshift32(z)));
}
which already uses 20 arithmetic operations and still has to be casted and normalized and
rand2(vec3 v){
return fract(sin(dot(v, vec3(12.9898, 78.233, ?))) * 43758.5453);
};
which might be faster but uses sin causing precision problems and different results on different GPU's.
Do you know any other algorithms requiring less arithmetic operations?
Thanks in advance.
Edit: Another standard PRNG is XORWOW as implemented in C as
xorwow() {
int x = 123456789, y = 362436069, z = 521288629, w = 88675123, v = 5783321, d = 6615241;
int t = (x ^ (x >> 2));
x = y;
y = z;
z = w;
w = v;
v = (v ^ (v << 4)) ^ (t ^ (t << 1));
return (d += 362437) + v;
}
Can we rewrite it to fit in our context?
So I need to solve for the linear system (A + i * mu * I) x = b, where A is dense Hermitian matrix (6x6 complex numbers), mu is a real scalar, and I is identity matrix.
Obviously if mu=0, I should just use Cholesky and be done with it. With non-zero mu though, the matrix ceases to be Hermitian and Cholesky fails.
Possible solutions:
Solve normal operator using Cholesky and multiply by the conjugate
Solve directly using LU decomposition
This is in a time-critical performance routine, where I need the most efficient method. Any thoughts on the optimum approach, or if there is a specific method for solving the above shifted Hermitian system?
This is to be deployed in a CUDA kernel, where I'll be solving many linear systems in parallel, e.g., one per thread. This means that I need a solution that minimizes thread divergence. Given the small system size, pivoting can be ignored without too much issue: this removes a possible source of thread divergence. I've already implemented an in-place Cholesky normal method, and while it's working ok, the performance isn't great in double precision.
I can't vouch for the stability of the method below, but if your matrix is reasonably well conditioned, it might be worth a try.
We want to solve
A*X = B
If we pick out the first row and column, say
A = ( a y )
( z A_ )
X = ( x )
( X_)
B = ( b )
( B_ )
The requirement is
a*x + y*X_ = b
z*x + A_*X_ = B_
so
x = (b - y*X_ )/a
(A_ - zy/a) * X_ = B_ - (b/a)z
The solution goes in two stages. First use the second equation to transform A and b, then use the second to form the solution x.
In C:
static void nhsol( int dim, complx* A, complx* B, complx* X)
{
int i, j, k;
complx a, fb, fa;
complx* z;
complx* acol;
// update A and B
for( i=0; i<dim; ++i)
{ z = A + i*dim;
a = z[i];
// update B
fb = B[i]/a;
for( j=i+1; j<dim; ++j)
{ B[j] -= fb*z[j];
}
// update A
for( k=i+1; k<dim; ++k)
{ acol = A + k*dim;
fa = acol[i]/a;
for( j=i+1; j<dim; ++j)
{ acol[j] -= fa*z[j];
}
}
}
// compute x
i = dim-1;
X[i] = B[i] / A[i+dim*i];
while( --i>=0)
{
complx s = B[i];
for( j=i+1; j<dim; ++j)
{ s -= A[i+j*dim]*X[j];
}
X[i] = s/A[i+i*dim];
}
}
where
typedef _Complex double complx;
If code space is not at a premuim it might be worth unrolling the loops. Personally I would do this by writing a program whose sole job was to write the code.
I'm interested in a fast method for "expanding bits," which can be defined as the following:
Let B be a binary number with n bits, i.e. B \in {0,1}^n
Let P be the position of all 1/true bits in B, i.e. 1 << p[i] & B == 1, and |P|=k
For another given number, A \in {0,1}^k, let Ap be the bit-expanded form of A given B, such that Ap[j] == A[j] << p[j].
The result of the "bit expansion" is Ap.
A couple examples:
Given B: 0010 1110, A: 0110, then Ap should be 0000 1100
Given B: 1001 1001, A: 1101, then Ap should be 1001 0001
Following is a straightforward algorithm, but I can't help shake the feeling that there's a faster/easier way to do this.
unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
int k = popcount(B); // cuda function, but there are good methods for this
unsigned int Ap = 0;
int j = k-1;
// Starting at the most significant bit,
for (int i = n - 1; i >= 0; --i) {
Ap <<= 1;
// if B is 1, add the value at A[j] to Ap, decrement j.
if (B & (1 << i)) {
Ap += (A >> j--) & 1;
}
}
return Ap;
}
The question appears to be asking for a CUDA emulation of the BMI2 instruction PDEP, which takes a source operand a, and deposits its bits based on the positions of the 1-bits of a mask b. There is no hardware support for an identical, or a similar, operation on currently shipping GPUs; that is, up to and including the Maxwell architecture.
I am assuming, based on the two examples given, that the mask b in general is sparse, and that we can minimize work by only iterating over the 1-bits of b. This could cause divergent branches on the GPU, but the exact trade-off in performance is unknown without knowledge of a specific use case. For now, I am assuming that the exploitation of sparsity in the mask b has a stronger positive influence on performance compared to the negative impact of divergence.
In the emulation code below, I have reduced the use of potentially "expensive" shift operations, instead relying mostly on simple ALU instructions. On various GPUs, shift instructions are executed with lower throughput than simple integer arithmetic. I have retained a single shift, off the critical path through the code, to avoid becoming execution limited by the arithmetic units. If desired, the expression 1U << i can be replaced by addition: introduce a variable m that is initialized to 1 before the loop and doubled each time through the loop.
The basic idea is to isolate each 1-bit of mask b in turn (starting at the least significant end), AND it with the value of the i-th bit of a, and incorporate the result into the expanded destination. After a 1-bit from b has been used, we remove it from the mask, and iterate until the mask becomes zero.
In order to avoid shifting the i-th bit of a into place, we simply isolate it and then replicate its value to all more significant bits by simple negation, taking advantage of the two's complement representation of integers.
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0;
int i;
for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
}
return r;
}
The variant without any shift operations alluded to above looks as follows:
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0, m = 1;
while (b) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
m = m + m; // mask for next bit of 'a'
}
return r;
}
In comments below, #Evgeny Kluev pointed to a shift-free PDEP emulation at the chessprogramming website that looks potentially faster than either of my two implementations above; it seems worth a try.
Simply,
X = Integer
Y = Another Integer
Z ( If used ,Integer Temp )
What's the most efficient method ?
Method I :
Z = X
X = Y
Y = Z
Method II :
X ^= Y
Y ^= X
X ^= Y
Edit I [ Assembly View ]
Method I :
MOV
MOV
MOV
Method II :
TEST ( AND )
JZ
XOR
XOR
XOR
Notes :
MOV is slower then XOR
TEST , JZ is used for XOR Equality Safe
`Method I uses extra register
In most cases, using a temporary variable (usually a register at assembly level) is the best choice, and the one that a compiler will tend to generate.
In most practical scenarios, the trivial swap algorithm using a
temporary register is more efficient. Limited situations in which XOR
swapping may be practical include: On a processor where the
instruction set encoding permits the XOR swap to be encoded in a
smaller number of bytes; In a region with high register pressure, it
may allow the register allocator to avoid spilling a register. In
microcontrollers where available RAM is very limited. Because these
situations are rare, most optimizing compilers do not generate XOR
swap code.
http://en.wikipedia.org/wiki/XOR_swap_algorithm
Also, your XOR Swap implementation fails if the same variable is passed as both arguments. A correct implementation (from the same link) would be:
void xorSwap (int *x, int *y) {
if (x != y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
}
Note that the code does not swap the integers passed immediately, but
first checks if their addresses are distinct. This is because, if the
addresses are equal, the algorithm will fold to a triple *x ^= *x
resulting in zero.
Try this way of swapping numbers
int a,b;
a=a+b-(b=a);
I have an application where a Hilbert R-Tree (wikipedia) (citeseer) would seem to be an appropriate data structure. Specifically, it requires reasonably fast spatial queries over a data set that will experience a lot of updates.
However, as far as I can see, none of the descriptions of the algorithms for this data structure even mention how to actually calculate the requisite Hilbert Value; which is the distance along a Hilbert Curve to the point.
So any suggestions for how to go about calculating this?
Fun question!
I did a bit of googling, and the good news is, I've found an implementation of Hilbert Value.
The potentially bad news is, it's in Haskell...
http://www.serpentine.com/blog/2007/01/11/two-dimensional-spatial-hashing-with-space-filling-curves/
It also proposes a Lebesgue distance metric you might be able to compute more easily.
Below is my java code adapted from C code in the paper "Encoding and decoding the Hilbert order" by Xian Lu and Gunther Schrack, published in Software: Practice and Experience Vol. 26 pp 1335-46 (1996).
Hope this helps. Improvements welcome !
Michael
/**
* Find the Hilbert order (=vertex index) for the given grid cell
* coordinates.
* #param x cell column (from 0)
* #param y cell row (from 0)
* #param r resolution of Hilbert curve (grid will have Math.pow(2,r)
* rows and cols)
* #return Hilbert order
*/
public static int encode(int x, int y, int r) {
int mask = (1 << r) - 1;
int hodd = 0;
int heven = x ^ y;
int notx = ~x & mask;
int noty = ~y & mask;
int temp = notx ^ y;
int v0 = 0, v1 = 0;
for (int k = 1; k < r; k++) {
v1 = ((v1 & heven) | ((v0 ^ noty) & temp)) >> 1;
v0 = ((v0 & (v1 ^ notx)) | (~v0 & (v1 ^ noty))) >> 1;
}
hodd = (~v0 & (v1 ^ x)) | (v0 & (v1 ^ noty));
return interleaveBits(hodd, heven);
}
/**
* Interleave the bits from two input integer values
* #param odd integer holding bit values for odd bit positions
* #param even integer holding bit values for even bit positions
* #return the integer that results from interleaving the input bits
*
* #todo: I'm sure there's a more elegant way of doing this !
*/
private static int interleaveBits(int odd, int even) {
int val = 0;
// Replaced this line with the improved code provided by Tuska
// int n = Math.max(Integer.highestOneBit(odd), Integer.highestOneBit(even));
int max = Math.max(odd, even);
int n = 0;
while (max > 0) {
n++;
max >>= 1;
}
for (int i = 0; i < n; i++) {
int bitMask = 1 << i;
int a = (even & bitMask) > 0 ? (1 << (2*i)) : 0;
int b = (odd & bitMask) > 0 ? (1 << (2*i+1)) : 0;
val += a + b;
}
return val;
}
See uzaygezen.
The code and java code above are fine for 2D data points. But for higher dimensions you may need to look at Jonathan Lawder's paper: J.K.Lawder. Calculation of Mappings Between One and n-dimensional Values Using the Hilbert Space-filling Curve.
I figured out a slightly more efficient way to interleave bits. It can be found at the Stanford Graphics Website. I included a version that I created that can interleave two 32 bit integers into one 64 bit long.
public static long spreadBits32(int y) {
long[] B = new long[] {
0x5555555555555555L,
0x3333333333333333L,
0x0f0f0f0f0f0f0f0fL,
0x00ff00ff00ff00ffL,
0x0000ffff0000ffffL,
0x00000000ffffffffL
};
int[] S = new int[] { 1, 2, 4, 8, 16, 32 };
long x = y;
x = (x | (x << S[5])) & B[5];
x = (x | (x << S[4])) & B[4];
x = (x | (x << S[3])) & B[3];
x = (x | (x << S[2])) & B[2];
x = (x | (x << S[1])) & B[1];
x = (x | (x << S[0])) & B[0];
return x;
}
public static long interleave64(int x, int y) {
return spreadBits32(x) | (spreadBits32(y) << 1);
}
Obviously, the B and S local variables should be class constants but it was left this way for simplicity.
Michael,
thanks for your Java code! I tested it and it seems to work fine, but I noticed that the bit-interleaving function overflows at recursion level 7 (at least in my tests, but I used long values), because the "n"-value is calculated using highestOneBit()-function, which returns the value and not the position of the highest one bit; so the loop does unnecessarily many interleavings.
I just changed it to the following snippet, and after that it worked fine.
int max = Math.max(odd, even);
int n = 0;
while (max > 0) {
n++;
max >>= 1;
}
If you need a spatial index with fast delete/insert capabilities, have a look at the PH-tree. It partly based on quadtrees but faster and more space efficient. Internally it uses a Z-curve which has slightly worse spatial properties than an H-curve but is much easier to calculate.
Paper: http://www.globis.ethz.ch/script/publication/download?docid=699
Java implementation: http://globis.ethz.ch/files/2014/11/ph-tree-2014-11-10.zip
Another option is the X-tree, which is also available here:
https://code.google.com/p/xxl/
Suggestion: A good simple efficient data structure for spatial queries is a multidimensional binary tree.
In a traditional binary tree, there is one "discriminant"; the value that's used to determine whether you take the left branch or the right branch. This can be considered to be the one-dimensional case.
In a multidimensional binary tree, you have multiple discriminants; consecutive levels use different discriminants. For example, for two dimensional spacial data, you could use the X and Y coordinates as discriminants. Consecutive levels would use X, Y, X, Y...
For spatial queries (for example finding all nodes within a rectangle) you do a depth-first search of the tree starting at the root, and you use the discriminant at each level to avoid searching down branches that contain no nodes in the given rectangle.
This allows you to potentially cut the search space in half at each level, making it very efficient for finding small regions in a massive data set. (BTW, this data structure is also useful for partial-match queries, i.e. queries that omit one or more discriminants. You just search down both branches at levels with an omitted discriminant.)
A good paper on this data structure: http://portal.acm.org/citation.cfm?id=361007
This article has good diagrams and algorithm descriptions: http://en.wikipedia.org/wiki/Kd-tree