There is no atomic minimal operation in OpenMP, also no intrinsic in Intel MIC's instruction set.
#pragmma omp critial is very insufficient in the performance.
I want to know if there is a high performance implement of atomic minimal for Intel MIC.
According to the OpenMP 4.0 Specifications (Section 2.12.6), there is a lot of fast atomic minimal operations you can do by using the #pragma omp atomic construct in place of #pragma omp critical (and thereby avoid the huge overhead of its lock).
Overview of the possibilities with the #pragma omp atomic construct
Let x be your thread-shared variable:
With #pragma omp atomic read you can atomically let your shared variable x be read:
v = x;
With #pragma omp atomic write you can atomically assign a new value to your shared variable x; the new value expression (expr) has to be x-independant:
x = expr;
With #pragma omp atomic update you can atomically update your shared variable x; in fact you can only assign a new value as a binary operation (binop) between an x-independant expression and x:
x++;
x--;
++x;
--x;
x binop= expr;
x = x binop expr;
x = expr binop x;
With #pragma omp atomic capture you can atomically let your shared variable x be read and updated (in the order you want); in fact capture is a combination of the read and update construct:
You have short forms for update and then read:
v = ++x;
v = --x;
v = x binop= expr;
v = x = x binop expr;
v = x = expr binop x;
And their structured-block analogs:
{--x; v = x;}
{x--; v = x;}
{++x; v = x;}
{x++; v = x;}
{x binop= expr; v = x;}
{x = x binop expr; v = x;}
{x = expr binop x; v = x;}
And you have a few short forms for read and then update:
v = x++;
v = x--;
And again their structured-block analogs:
{v = x; x++;}
{v = x; ++x;}
{v = x; x--;}
{v = x; --x;}
And finally you have additional read then update, which only exists in structured-block forms :
{v = x; x binop= expr;}
{v = x; x = x binop expr;}
{v = x; x = expr binop x;}
{v = x; x = expr;}
In the preceding expressions:
x and v are both l-value expressions with scalar type;
expr is an expression with scalar type;
binop is one of +, *, -, /, &, ^, |, << or >>;
binop, binop=, ++ and -- are not overloaded operators.
Related
For a noise shader I'm looking for a pseudo random number algorithm with 3d vector argument, i.e.,
for every integer vector it returns a value in [0,1].
It should be as fast as possible without producing visual artifacts and giving the same results on every GPU.
Two variants (pseudo code) I found are
rand1(vec3 (x,y,z)){
return xorshift32(x ^ xorshift32(y ^ xorshift32(z)));
}
which already uses 20 arithmetic operations and still has to be casted and normalized and
rand2(vec3 v){
return fract(sin(dot(v, vec3(12.9898, 78.233, ?))) * 43758.5453);
};
which might be faster but uses sin causing precision problems and different results on different GPU's.
Do you know any other algorithms requiring less arithmetic operations?
Thanks in advance.
Edit: Another standard PRNG is XORWOW as implemented in C as
xorwow() {
int x = 123456789, y = 362436069, z = 521288629, w = 88675123, v = 5783321, d = 6615241;
int t = (x ^ (x >> 2));
x = y;
y = z;
z = w;
w = v;
v = (v ^ (v << 4)) ^ (t ^ (t << 1));
return (d += 362437) + v;
}
Can we rewrite it to fit in our context?
I am using Eigen on google cloud platform on an ubuntu machine. I have installed gcc-7 and I am trying to run a C++ code containing a huge matrix (1000X1000) of doubles.
The code is running fine on Xcode 11.3 with clang, but on the ubuntu machine I am having a problem with variable type conversion.
The part of the code which is causing problems:
Matrix<double, Dynamic, Dynamic> Mat(double x, double y, double En, int Np, double Kmax){
Matrix<double, Dynamic, Dynamic> M(Np, Np);
double eps = (double)Kmax / (double)Np;
double result, error;
struct params ps;
ps.x = x;
ps.y = y;
ps.En = En;
gsl_set_error_handler_off();
gsl_integration_workspace * w = gsl_integration_workspace_alloc (iters);
gsl_function F;
F.function = &intDeltFunc;
F.params = &ps;
gsl_integration_qags (&F, 0, y, 0, prec, iters, w, &result, &error);
gsl_integration_workspace_free (w);
for ( int i = 0; i < Np ; i++){
for ( int j = 0; j < Np ; j++){
double p = eps * i + y;
double q = eps * j + y;
if ( (j == 0) || (j == Np - 1) ){
M(i , j) = Diag(p, q, x, y, En) + eps * Integral(p, q, x, y, En, result) / 2.0;
}
else{
M(i , j) = Diag(p, q, x, y, En) + eps * Integral(p, q, x, y, En, result);
}
}
}
return M;}
Here Np = 1000.
The problem encountered with ubuntu:
main.cpp:247:26: error: no viable overloaded '='
M(i , j) = DiagTerm + 0.5 * eps * IntegralTerm;
~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/Eigen/src/Core/MatrixBase.h:139:14: note: candidate function not viable: no known conversion
from 'double' to 'const Eigen::MatrixBase<Eigen::IndexedView<Eigen::Matrix<double, -1, -1, 0, -1, -1>,
double, double> >' for 1st argument
Derived& operator=(const MatrixBase& other);
I have tried to download Eigen multiple times, to refer the compiler to its exact location when compiling, and also to use clang (which Xcode uses) and I still have the problem.
Do you have any ideas what might be the problem ?
final Stream<Integer> numbers = Stream.of(5, 3, 2, 7, 3, 13, 7).parallel();
Why the output of the following line is 7?
numbers.reduce(1, (a, b) -> a + b, (x, y) -> x - y));
I have not looked at that link from the comments, but the documentation is pretty clear about identity and it even provides a simple way of testing that:
The identity value must be an identity for the combiner function. This means that for all u, combiner(identity, u) is equal to u
So let's simplify your example a bit:
Stream<Integer> numbers = Stream.of(3, 1).parallel();
BiFunction<Integer, Integer, Integer> accumulator = (a, b) -> a + b;
BiFunction<Integer, Integer, Integer> combiner = (x, y) -> x - y;
int result = numbers.reduce(
1,
accumulator,
combiner);
System.out.println(result);
let's say that u = 3 (just a random element from the Stream), thus:
int identity = 1;
int u = 3;
int toTest = combiner.apply(identity, u);
System.out.println(toTest == identity); // must be true, but is false
Even if you think that you would replace identity with zero, that would work; the documentation makes another argument:
Additionally, combiner function must be compatible with the accumulator function; for all u and t, the following must hold:
combiner.apply(u, accumulator.apply(identity, t)) == accumulator.apply(u, t)
You can make the same test:
int identity = 0;
int u = 3;
int t = 1;
boolean associativityRespected =
combiner.apply(u, accumulator.apply(identity, t)) == accumulator.apply(u, t);
System.out.println(associativityRespected); // prints false
Originally this post requested an inverse sheep-and-goats operation, but I realized that it was more than I really needed, so I edited the title, because I only need an expand-right algorithm, which is simpler. The example that I described below is still relevant.
Original Post:
I'm trying to figure out how to do either an inverse sheep-and-goats operation or, even better, an expand-right-flip.
According to Hacker's Delight, a sheeps-and-goats operation can be represented by:
SAG(x, m) = compress_left(x, m) | compress(x, ~m)
According to this site, the inverse can be found by:
INV_SAG(x, m, sw) = expand_left(x, ~m, sw) | expand_right(x, m, sw)
However, I can't find any code for the expand_left and expand_right functions. They are, of course, the inverse functions for compress, but compress is kind of hard to understand in itself.
Example:
To better explain what I'm looking for, consider a set of 8 bits like:
0000abcd
The variables a, b, c and d may be either ones or zeros. In addition, there is a mask which repositions the bits. So for example, if the mask were 01100101, the resulting bits would be repositioned as follows:
0ab00c0d
This can be done with an inverse sheeps-and-goats operation. However, according to this section of the site mentioned above, there is a more efficient way which he refers to as the expand-right-flip. Looking at his site, I was unable to figure out how that can be done.
Here's the expand_right from Hacker's Delight, it just says expand but it's the right version.
unsigned expand(unsigned x, unsigned m) {
unsigned m0, mk, mp, mv, t;
unsigned array[5];
int i;
m0 = m; // Save original mask.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel suffix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
array[i] = mv;
m = (m ^ mv) | (mv >> (1 << i)); // Compress m.
mk = mk & ~mp;
}
for (i = 4; i >= 0; i--) {
mv = array[i];
t = x << (1 << i);
x = (x & ~mv) | (t & mv);
}
return x & m0; // Clear out extraneous bits.
}
You can use expand_left(x, m) == expand_right(x >> (32 - popcnt(m)), m) to make the left version, but that's probably not the best way.
how to swap two numbers inplace without using any additional space?
You can do it using XOR operator as:
if( x != y) { // this check is very important.
x ^= y;
y ^= x;
x ^= y;
}
EDIT:
Without the additional check the above logic fails to swap the number with itself.
Example:
int x = 10;
if I apply the above logic to swap x with itself, without the check I end up having x=0, which is incorrect.
Similarly if I put the logic without the check in a function and call the function to swap two references to the same variable, it fails.
If you have 2 variables a and b: (each variable occupies its own memory address)
a = a xor b
b = a xor b
a = a xor b
There are also some other variations to this problem but they will fail if there is overflow:
a=a+b
b=a-b
a=a-b
a=a*b
b=a/b
a=a/b
The plus and minus variation may work if you have custom types that have + and - operators that make sense.
Note: To avoid confusion, if you have only 1 variable, and 2 references or pointers to it, then all of the above will fail. A check should be made to avoid this.
Unlike a lot of people are saying it does not matter if you have 2 different numbers. It only matters that you have 2 distinct variables where the number exists in 2 different memory addresses.
I.e. this is perfectly valid:
int a = 3;
int b = 3;
a = a ^ b;
b = a ^ b;
a = a ^ b;
assert(a == b);
assert(a == 3);
The xor trick is the standard answer:
int x, y;
x ^= y;
y ^= x;
x ^= y;
xoring is considerably less clear than just using a temp, though, and it fails if x and y are the same location
Since no langauge was mentioned, in Python:
y, x = x, y