I'm just learning OpenMP and trying to parallelize the following construct:
for (r = 0; r <= (int)(N/2); r++)
{
answer = ((answer % MOD) + (2 * (solve(N, r) % MOD)) % MOD) % MOD;
}
It seems to be natural to use reduction, but modulo operator is not supported for that case.
Also I tried to use private variables, but race conditions are still there:
#pragma omp parallel for private(temp)
for (r = 0; r <= (int)(N/2); r++)
{
temp = answer;
answer = ((temp % MOD) + (2 * (solve(N, r) % MOD)) % MOD) % MOD;
}
Is there any way to get it work?
Thanks.
You can use the property that (x%a + y%a)%a = (x+y)%a. Then
answer = ((answer % MOD) + (2 * (solve(N, r) % MOD)) % MOD) % MOD;
is the same as
answer = (answer + 2 * solve(N, r) % MOD) % MOD;
Over the sum you can show this is the same as
Sum(answer + 2 * solve(N, r) % MOD)%MOD
Therefore, all you need to do is this
#pragma omp parallel for reduction(+:answer)
for (r = 0; r <= (int)(N/2); r++)
{
answer += 2 * solve(N, r) % MOD
}
answer%=MOD;
This could overflow for. In that case you can do a custom reduction like this
#pragma omp parallel
{
int answer_private = 0;
#pragma omp for nowait
for(int i=0; i<n; i++) {
answer_private = ((answer_private % MOD) + (2 * (solve(N, r) % MOD)) % MOD) % MOD;
}
#pragma omp critical
{
answer = (answer%MOD + answer_private%MOD)%MOD;
}
}
Since you are restricted by whatever is possible with OpenMP 2.0, your way to go is writing a custom reduction. The basic scheme is:
Before the parallel region, create an array for partial results. It should have at least as many elements as there are threads in the parallel region; you might use omp_get_max_threads() function which is the upper bound. Initialize it with zeros (the identity element for summation).
Inside the parallel region, use omp_get_thread_num() function to obtain the number of the current thread within the parallel region, and use it as the index to the above array. Accumulate the result in the corresponding array element.
After the region, use a serial loop to reduce the partial results accumulated in the array.
Related
This is the link to this algorithm topic: https://codeforces.com/problemset/problem/615/D
my code time limit exceeded on test40, I thought for a long time but no good way, is there a good optimization method, may be ?
mycode:
typedef long long ll;
ll mod = 1e9 + 7;
ll fast_mod(ll a, ll n, ll Mod)
{
ll ans=1;
a%=Mod;
while(n)
{
if(n&1) ans=(ans*a)%Mod;
a=(a*a)%Mod;
n>>=1;
}
return ans;
}
int main()
{
std::ios::sync_with_stdio(false);
std::cin.tie(0); // IO
ll m;
cin >> m;
ll num = 1ll;
map<ll, ll> count;
for(int i = 0; i < m; i++)
{
ll p;
cin >> p;
count[p]++;
}
ll res = 1ll;
vector<ll> a;
vector<ll> b;
for(auto it = count.begin(); it != count.end(); it++)
{
a.push_back(it -> first);
b.push_back(it -> second);
}
for(int i = 0; i < a.size(); i++)
{
ll x = a[i]; // a kind of prime
ll y = b[i]; // the count of the prime
ll tmp = fast_mod(x, y * (y + 1) / 2, mod); // x^1 * x^2 * x^3 *...*x^y
for(int j = 0; j < b.size(); j++) // calculate ( tmp)^((b[0] + 1)*(b[1] + 1)*...*(b[b.size() - 1] + 1)), here b.size() is the number of different primes
tmp = fast_mod(tmp, i != j ? (b[j] + 1) : 1, mod) % mod;
res = (res * tmp % mod);
}
cout << res << endl;
return 0;
}
Find the number of each different prime number, suppose x is one of the different prime number, then calculate x^1x^2...x^y, y is the count of x, the result as tmp.Then the product of count of
other prime plus one as the exponent: (b[0] + 1)(b[1] +1)...(b[b.size() - 1] + 1), tmp as base.
The for loop divide the calculation into several steps.
Last, res * (tmp^ ((b[0] + 1)(b[1] +1)...*(b[b.size() - 1] + 1)))
An other formula for the product of the divisors of N is N ** (D/ 2), where D is the number of divisors and may be found from your map count by taking the product of entry->second + 1 for every entry.
This does raise the question of what to do when D is odd, which it would be if N is a perfect square. In that case it is easy to compute sqrt(N) (the exponents would all be even, so you can halve them all and take the product of the primes to half of their original exponents), and then raise sqrt(N) to the power of D. Essentially this changes N ** (D / 2) into (N ** (1 / 2)) ** D.
For example if N = 2 * 3 * 2 = 12 (one of the examples), then D will be (2 + 1) * (1 + 1) = 6 and the product of divisors will be 12 ** (6 / 2) = 1728.
Computing N (or its square root) should done modulo mod. Computing D should be done modulo mod - 1 (the totient of mod, mod is a prime so its totient is just one less). mod - 1 is even, so we could not have computed the modular multiplicative inverse of 2 to "divide" D by 2 that way. When N is a square then AFAIK we're really stuck with computing its square root (that's not so bad, but multiplying by a half would have been easier).
I'm currently trying to get my matrix-vector multiplication function to compare favorably with BLAS by combining #pragma omp for with #pragma omp simd, but it's not getting any speedup improvement than if I were to just use the for construct. How do I properly vectorize the inner loop with OpenMP's SIMD construct?
vector dot(const matrix& A, const vector& x)
{
assert(A.shape(1) == x.size());
vector y = xt::zeros<double>({A.shape(0)});
int i, j;
#pragma omp parallel shared(A, x, y) private(i, j)
{
#pragma omp for // schedule(static)
for (i = 0; i < y.size(); i++) { // row major
#pragma omp simd
for (j = 0; j < x.size(); j++) {
y(i) += A(i, j) * x(j);
}
}
}
return y;
}
Your directive is incorrect because there would introduce in a race condition (on y(i)). You should use a reduction in this case. Here is an example:
vector dot(const matrix& A, const vector& x)
{
assert(A.shape(1) == x.size());
vector y = xt::zeros<double>({A.shape(0)});
int i, j;
#pragma omp parallel shared(A, x, y) private(i, j)
{
#pragma omp for // schedule(static)
for (i = 0; i < y.size(); i++) { // row major
decltype(y(0)) sum = 0;
#pragma omp simd reduction(+:sum)
for (j = 0; j < x.size(); j++) {
sum += A(i, j) * x(j);
}
y(i) += sum;
}
}
return y;
}
Note that it may not be necessary faster because some compilers are able to automatically vectorize the code (ICC for example). GCC and Clang often fail to perform (advanced) SIMD reductions automatically and such a directive help them a bit. You can check the assembly code to check how the code is vectorized or enable vectorization reports (see here for GCC).
The problem is given as:
Output the answer of (A^1+A^2+A^3+...+A^K) modulo 1,000,000,007, where 1≤ A, K ≤ 10^9, and A and K must be an integer.
I am trying to write a program to compute the above question. I have tried using the formula for geometric sequence, then applying the modulo on the answer. Since the results must be an integer as well, finding modulo inverse is not required.
Below is the code I have now, its in pascal
Var
a,k,i:longint;
power,sum: int64;
Begin
Readln(a,k);
power := 1;
For i := 1 to k do
power := ((power mod 1000000007) * a) mod 1000000007;
sum := a * (power-1) div (a-1);
Writeln(sum mod 1000000007);
End.
This task came from my school, they do not give away their test data to the students. Hence I do not know why or where my program is wrong. I only know that my program outputs the wrong answer for their test data.
If you want to do this without calculating a modular inverse, you can calculate it recursively using:
1+ A + A2 + A3 + ... + Ak
= 1 + (A + A2)(1 + A2 + (A2)2 + ... + (A2)k/2-1)
That's for even k. For odd k:
1+ A + A2 + A3 + ... + Ak
= (1 + A)(1 + A2 + (A2)2 + ... + (A2)(k-1)/2)
Since k is divided by 2 in each recursive call, the resulting algorithm has O(log k) complexity. In java:
static int modSumAtoAk(int A, int k, int mod)
{
return (modSum1ToAk(A, k, mod) + mod-1) % mod;
}
static int modSum1ToAk(int A, int k, int mod)
{
long sum;
if (k < 5) {
//k is small -- just iterate
sum = 0;
long x = 1;
for (int i=0; i<=k; ++i) {
sum = (sum+x) % mod;
x = (x*A) % mod;
}
return (int)sum;
}
//k is big
int A2 = (int)( ((long)A)*A % mod );
if ((k%2)==0) {
// k even
sum = modSum1ToAk(A2, (k/2)-1, mod);
sum = (sum + sum*A) % mod;
sum = ((sum * A) + 1) % mod;
} else {
// k odd
sum = modSum1ToAk(A2, (k-1)/2, mod);
sum = (sum + sum*A) % mod;
}
return (int)sum;
}
Note that I've been very careful to make sure that each product is done in 64 bits, and to reduce by the modulus after each one.
With a little math, the above can be converted to an iterative version that doesn't require any storage:
static int modSumAtoAk(int A, int k, int mod)
{
// first, we calculate the sum of all 1... A^k
// we'll refer to that as SUM1 in comments below
long fac=1;
long add=0;
//INVARIANT: SUM1 = add + fac*(sum 1...A^k)
//this will remain true as we change k
while (k > 0) {
//above INVARIANT is true here, too
long newmul, newadd;
if ((k%2)==0) {
//k is even. sum 1...A^k = 1+A*(sum 1...A^(k-1))
newmul = A;
newadd = 1;
k-=1;
} else {
//k is odd.
newmul = A+1L;
newadd = 0;
A = (int)(((long)A) * A % mod);
k = (k-1)/2;
}
//SUM1 = add + fac * (newadd + newmul*(sum 1...Ak))
// = add+fac*newadd + fac*newmul*(sum 1...Ak)
add = (add+fac*newadd) % mod;
fac = (fac*newmul) % mod;
//INVARIANT is restored
}
// k == 0
long sum1 = fac + add;
return (int)((sum1 + mod -1) % mod);
}
I want to compute the average of an image (3 channels of interest + 1 alpha channel we ignore here) for each channel using SSE2 intrinsics. I tried that:
__m128 average = _mm_setzero_ps();
#pragma omp parallel for reduction(+:average)
for(size_t k = 0; k < roi_out->height * roi_out->width * ch; k += ch)
{
float *in = ((float *)temp) + k;
average += _mm_load_ps(in);
}
But I get this error with GCC: user-defined reduction not found for average.
Is that possible with SSE2 ? What's wrong ?
Edit
This works:
float sum[4] = { 0.0f };
#pragma omp parallel for simd reduction(+:sum[:4])
for(size_t k = 0; k < roi_out->height * roi_out->width * ch; k += ch)
{
float *in = ((float *)temp) + k;
for (int i = 0; i < ch; ++i) sum[i] += in[i];
}
const __m128 average = _mm_load_ps(sum) / ((float)roi_out->height * roi_out->width);
You can user-define a custom reduction like this:
#pragma omp declare reduction \
(addps:__m128:omp_out+=omp_in) \
initializer(omp_priv=_mm_setzero_ps())
And then use it like:
#pragma omp parallel for reduction(addps:average)
for(size_t k = 0; k < size * ch; k += ch)
{
average += _mm_loadu_ps(data+k);
}
I think, most importantly, openmp needs to know how to get a neutral element (here _mm_setzero_ps()) for your reduction.
Full working example: https://godbolt.org/z/Fpqttc
Interesting link: http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-reduction.html#User-definedreductions
Can someone give me an idea of an efficient algorithm for large n (say 10^10) to find the sum of above series?
Mycode is getting klilled for n= 100000 and m=200000
#include<stdio.h>
int main() {
int n,m,i,j,sum,t;
scanf("%d%d",&n,&m);
sum=0;
for(i=1;i<=n;i++) {
t=1;
for(j=1;j<=i;j++)
t=((long long)t*i)%m;
sum=(sum+t)%m;
}
printf("%d\n",sum);
}
Two notes:
(a + b + c) % m
is equivalent to
(a % m + b % m + c % m) % m
and
(a * b * c) % m
is equivalent to
((a % m) * (b % m) * (c % m)) % m
As a result, you can calculate each term using a recursive function in O(log p):
int expmod(int n, int p, int m) {
if (p == 0) return 1;
int nm = n % m;
long long r = expmod(nm, p / 2, m);
r = (r * r) % m;
if (p % 2 == 0) return r;
return (r * nm) % m;
}
And sum elements using a for loop:
long long r = 0;
for (int i = 1; i <= n; ++i)
r = (r + expmod(i, i, m)) % m;
This algorithm is O(n log n).
I think you can use Euler's theorem to avoid some exponentation, as phi(200000)=80000. Chinese remainder theorem might also help as it reduces the modulo.
You may have a look at my answer to this post. The implementation there is slightly buggy, but the idea is there. The key strategy is to find x such that n^(x-1)<m and n^x>m and repeatedly reduce n^n%m to (n^x%m)^(n/x)*n^(n%x)%m. I am sure this strategy works.
I encountered similar question recently: my 'n' is 1435, 'm' is 10^10. Here is my solution (C#):
ulong n = 1435, s = 0, mod = 0;
mod = ulong.Parse(Math.Pow(10, 10).ToString());
for (ulong i = 1; i <= n;
{
ulong summand = i;
for (ulong j = 2; j <= i; j++)
{
summand *= i;
summand = summand % mod;
}
s += summand;
s = s % mod;
}
At the end 's' is equal to required number.
Are you getting killed here:
for(j=1;j<=i;j++)
t=((long long)t*i)%m;
Exponentials mod m could be implemented using the sum of squares method.
n = 10000;
m = 20000;
sqr = n;
bit = n;
sum = 0;
while(bit > 0)
{
if(bit % 2 == 1)
{
sum += sqr;
}
sqr = (sqr * sqr) % m;
bit >>= 2;
}
I can't add comment, but for the Chinese remainder theorem, see http://mathworld.wolfram.com/ChineseRemainderTheorem.html formulas (4)-(6).