High performance implement of atomic minimal operation

High performance implement of atomic minimal operation - parallel-processing

There is no atomic minimal operation in OpenMP, also no intrinsic in Intel MIC's instruction set.
#pragmma omp critial is very insufficient in the performance.
I want to know if there is a high performance implement of atomic minimal for Intel MIC.

According to the OpenMP 4.0 Specifications (Section 2.12.6), there is a lot of fast atomic minimal operations you can do by using the #pragma omp atomic construct in place of #pragma omp critical (and thereby avoid the huge overhead of its lock).
Overview of the possibilities with the #pragma omp atomic construct
Let x be your thread-shared variable:
With #pragma omp atomic read you can atomically let your shared variable x be read:
v = x;
With #pragma omp atomic write you can atomically assign a new value to your shared variable x; the new value expression (expr) has to be x-independant:
x = expr;
With #pragma omp atomic update you can atomically update your shared variable x; in fact you can only assign a new value as a binary operation (binop) between an x-independant expression and x:
x++;
x--;
++x;
--x;
x binop= expr;
x = x binop expr;
x = expr binop x;
With #pragma omp atomic capture you can atomically let your shared variable x be read and updated (in the order you want); in fact capture is a combination of the read and update construct:
You have short forms for update and then read:
v = ++x;
v = --x;
v = x binop= expr;
v = x = x binop expr;
v = x = expr binop x;
And their structured-block analogs:
{--x; v = x;}
{x--; v = x;}
{++x; v = x;}
{x++; v = x;}
{x binop= expr; v = x;}
{x = x binop expr; v = x;}
{x = expr binop x; v = x;}
And you have a few short forms for read and then update:
v = x++;
v = x--;
And again their structured-block analogs:
{v = x; x++;}
{v = x; ++x;}
{v = x; x--;}
{v = x; --x;}
And finally you have additional read then update, which only exists in structured-block forms :
{v = x; x binop= expr;}
{v = x; x = x binop expr;}
{v = x; x = expr binop x;}
{v = x; x = expr;}
In the preceding expressions:
x and v are both l-value expressions with scalar type;
expr is an expression with scalar type;
binop is one of +, *, -, /, &, ^, |, << or >>;
binop, binop=, ++ and -- are not overloaded operators.

Related

Fast noise/PRNG from 3d input

For a noise shader I'm looking for a pseudo random number algorithm with 3d vector argument, i.e.,
for every integer vector it returns a value in [0,1].
It should be as fast as possible without producing visual artifacts and giving the same results on every GPU.
Two variants (pseudo code) I found are
rand1(vec3 (x,y,z)){
return xorshift32(x ^ xorshift32(y ^ xorshift32(z)));
}
which already uses 20 arithmetic operations and still has to be casted and normalized and
rand2(vec3 v){
return fract(sin(dot(v, vec3(12.9898, 78.233, ?))) * 43758.5453);
};
which might be faster but uses sin causing precision problems and different results on different GPU's.
Do you know any other algorithms requiring less arithmetic operations?
Thanks in advance.
Edit: Another standard PRNG is XORWOW as implemented in C as
xorwow() {
int x = 123456789, y = 362436069, z = 521288629, w = 88675123, v = 5783321, d = 6615241;
int t = (x ^ (x >> 2));
x = y;
y = z;
z = w;
w = v;
v = (v ^ (v << 4)) ^ (t ^ (t << 1));
return (d += 362437) + v;
}
Can we rewrite it to fit in our context?

Variable type matching problem with Eigen

I am using Eigen on google cloud platform on an ubuntu machine. I have installed gcc-7 and I am trying to run a C++ code containing a huge matrix (1000X1000) of doubles.
The code is running fine on Xcode 11.3 with clang, but on the ubuntu machine I am having a problem with variable type conversion.
The part of the code which is causing problems:
Matrix<double, Dynamic, Dynamic> Mat(double x, double y, double En, int Np, double Kmax){
Matrix<double, Dynamic, Dynamic> M(Np, Np);
double eps = (double)Kmax / (double)Np;
double result, error;
struct params ps;
ps.x = x;
ps.y = y;
ps.En = En;
gsl_set_error_handler_off();
gsl_integration_workspace * w = gsl_integration_workspace_alloc (iters);
gsl_function F;
F.function = &intDeltFunc;
F.params = &ps;
gsl_integration_qags (&F, 0, y, 0, prec, iters, w, &result, &error);
gsl_integration_workspace_free (w);
for ( int i = 0; i < Np ; i++){
for ( int j = 0; j < Np ; j++){
double p = eps * i + y;
double q = eps * j + y;
if ( (j == 0) || (j == Np - 1) ){
M(i , j) = Diag(p, q, x, y, En) + eps * Integral(p, q, x, y, En, result) / 2.0;
}
else{
M(i , j) = Diag(p, q, x, y, En) + eps * Integral(p, q, x, y, En, result);
}
}
}
return M;}
Here Np = 1000.
The problem encountered with ubuntu:
main.cpp:247:26: error: no viable overloaded '='
M(i , j) = DiagTerm + 0.5 * eps * IntegralTerm;
~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/Eigen/src/Core/MatrixBase.h:139:14: note: candidate function not viable: no known conversion
from 'double' to 'const Eigen::MatrixBase<Eigen::IndexedView<Eigen::Matrix<double, -1, -1, 0, -1, -1>,
double, double> >' for 1st argument
Derived& operator=(const MatrixBase& other);
I have tried to download Eigen multiple times, to refer the compiler to its exact location when compiling, and also to use clang (which Xcode uses) and I still have the problem.
Do you have any ideas what might be the problem ?

Reduce with identity combiner parallel stream

final Stream<Integer> numbers = Stream.of(5, 3, 2, 7, 3, 13, 7).parallel();
Why the output of the following line is 7?
numbers.reduce(1, (a, b) -> a + b, (x, y) -> x - y));

I have not looked at that link from the comments, but the documentation is pretty clear about identity and it even provides a simple way of testing that:
The identity value must be an identity for the combiner function. This means that for all u, combiner(identity, u) is equal to u
So let's simplify your example a bit:
Stream<Integer> numbers = Stream.of(3, 1).parallel();
BiFunction<Integer, Integer, Integer> accumulator = (a, b) -> a + b;
BiFunction<Integer, Integer, Integer> combiner = (x, y) -> x - y;
int result = numbers.reduce(
1,
accumulator,
combiner);
System.out.println(result);
let's say that u = 3 (just a random element from the Stream), thus:
int identity = 1;
int u = 3;
int toTest = combiner.apply(identity, u);
System.out.println(toTest == identity); // must be true, but is false
Even if you think that you would replace identity with zero, that would work; the documentation makes another argument:
Additionally, combiner function must be compatible with the accumulator function; for all u and t, the following must hold:
combiner.apply(u, accumulator.apply(identity, t)) == accumulator.apply(u, t)
You can make the same test:
int identity = 0;
int u = 3;
int t = 1;
boolean associativityRespected =
combiner.apply(u, accumulator.apply(identity, t)) == accumulator.apply(u, t);
System.out.println(associativityRespected); // prints false

Expand Right Bitwise Algorithm

Originally this post requested an inverse sheep-and-goats operation, but I realized that it was more than I really needed, so I edited the title, because I only need an expand-right algorithm, which is simpler. The example that I described below is still relevant.
Original Post:
I'm trying to figure out how to do either an inverse sheep-and-goats operation or, even better, an expand-right-flip.
According to Hacker's Delight, a sheeps-and-goats operation can be represented by:
SAG(x, m) = compress_left(x, m) | compress(x, ~m)
According to this site, the inverse can be found by:
INV_SAG(x, m, sw) = expand_left(x, ~m, sw) | expand_right(x, m, sw)
However, I can't find any code for the expand_left and expand_right functions. They are, of course, the inverse functions for compress, but compress is kind of hard to understand in itself.
Example:
To better explain what I'm looking for, consider a set of 8 bits like:
0000abcd
The variables a, b, c and d may be either ones or zeros. In addition, there is a mask which repositions the bits. So for example, if the mask were 01100101, the resulting bits would be repositioned as follows:
0ab00c0d
This can be done with an inverse sheeps-and-goats operation. However, according to this section of the site mentioned above, there is a more efficient way which he refers to as the expand-right-flip. Looking at his site, I was unable to figure out how that can be done.

Here's the expand_right from Hacker's Delight, it just says expand but it's the right version.
unsigned expand(unsigned x, unsigned m) {
unsigned m0, mk, mp, mv, t;
unsigned array[5];
int i;
m0 = m; // Save original mask.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel suffix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
array[i] = mv;
m = (m ^ mv) | (mv >> (1 << i)); // Compress m.
mk = mk & ~mp;
}
for (i = 4; i >= 0; i--) {
mv = array[i];
t = x << (1 << i);
x = (x & ~mv) | (t & mv);
}
return x & m0; // Clear out extraneous bits.
}
You can use expand_left(x, m) == expand_right(x >> (32 - popcnt(m)), m) to make the left version, but that's probably not the best way.

swaping inplace

how to swap two numbers inplace without using any additional space?

You can do it using XOR operator as:
if( x != y) { // this check is very important.
x ^= y;
y ^= x;
x ^= y;
}
EDIT:
Without the additional check the above logic fails to swap the number with itself.
Example:
int x = 10;
if I apply the above logic to swap x with itself, without the check I end up having x=0, which is incorrect.
Similarly if I put the logic without the check in a function and call the function to swap two references to the same variable, it fails.

If you have 2 variables a and b: (each variable occupies its own memory address)
a = a xor b
b = a xor b
a = a xor b
There are also some other variations to this problem but they will fail if there is overflow:
a=a+b
b=a-b
a=a-b
a=a*b
b=a/b
a=a/b
The plus and minus variation may work if you have custom types that have + and - operators that make sense.
Note: To avoid confusion, if you have only 1 variable, and 2 references or pointers to it, then all of the above will fail. A check should be made to avoid this.
Unlike a lot of people are saying it does not matter if you have 2 different numbers. It only matters that you have 2 distinct variables where the number exists in 2 different memory addresses.
I.e. this is perfectly valid:
int a = 3;
int b = 3;
a = a ^ b;
b = a ^ b;
a = a ^ b;
assert(a == b);
assert(a == 3);

The xor trick is the standard answer:
int x, y;
x ^= y;
y ^= x;
x ^= y;
xoring is considerably less clear than just using a temp, though, and it fails if x and y are the same location

Since no langauge was mentioned, in Python:
y, x = x, y

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

High performance implement of atomic minimal operation - parallel-processing

There is no atomic minimal operation in OpenMP, also no intrinsic in Intel MIC's instruction set. #pragmma omp critial is very insufficient in the performance. I want to know if there is a high performance implement of atomic minimal for Intel MIC.

Related

Fast noise/PRNG from 3d input

Variable type matching problem with Eigen

Reduce with identity combiner parallel stream

Expand Right Bitwise Algorithm

swaping inplace

Categories

Resources