Parallel Bellman-Ford implementation

Parallel Bellman-Ford implementation - parallel-processing

Can anyone point me to a good pseudocode of a simple parallel shortest path algorithm? Or any language, it doesn't matter. I'm having trouble finding good examples =[

I eventually implemented it myself for a bitcoin bot using OpenMP:
/*defines the chunk size as 1 contiguous iteration*/
#define CHUNKSIZE 1
/*forks off the threads*/
#pragma omp parallel private(i) {
/*Starts the work sharing construct*/
#pragma omp for schedule(dynamic, CHUNKSIZE)
list<list_node>::iterator i;
for (int u = 0; u < V - 1; u++) {
if (dist[u] != INT_MAX) {
for (i = adj[u].begin(); i != adj[u].end(); ++i) {
if (dist[i->get_vertex()] > dist[u] + i->get_weight()) {
dist[i->get_vertex()] = dist[u] + i->get_weight();
pre[i->get_vertex()] = u;
}
}
}
}
}
If you want to look at my full implementation, you can view it as a Gist on my GitHub

Related

I have algorithm for finding maximum flow. Does it have an author or name?

I'm beginner in programming and I'm learning algorithms to find maximum flows.
Most of them are rather difficult such as Ford-Falkerson, Edmunds-Karp and Dinitz. The problem is here: https://cses.fi/problemset/task/1694
I found an algorithm that finds maximum flow just for one depth first search for O(n+m). What is the name or author of this algorithm? This solution uses just one DFS. All standard algorithms use many DFS or BFS searches, but not this. I'm a bit confused.
#include<bits/stdc++.h>
using namespace std;
vector<vector<pair<int, long long>>> adj;
vector<bool> visited;
long long dfs(int to) {
long long r = 0;
visited[to] = true;
for (const auto& [from, flow]: adj[to]) {
if (from == 1 || visited[from]) {
r += flow;
} else {
r += min(flow, dfs(from));
}
}
return r;
}
int main() {
int n, m;
cin >> n >> m;
adj.resize(n + 1);
visited.resize(n + 1, false);
for (int i = 0; i < m; i++) {
int from, to, flow;
cin >> from >> to >> flow;
adj[to].push_back({from, flow});
}
cout << dfs(n);
}
Can you help me to understand why Dinitz or Edmunds-Karp, with O(m2n) complexity, are needed?

Breaks on the following graph since the second visit to v will incorrectly assume that there is one unit of flow available to pull:

Difference between mutual exclusion like atomic and reduction in OpenMP

I'm am following video lectures of Tim Mattson on OpenMP and there was one exercise to find errors in provided code that count area of the Mandelbrot. So here is the solution that was provided:
#define NPOINTS 1000
#define MAXITER 1000
void testpoint(struct d_complex);
struct d_complex{
double r;
double i;
};
struct d_complex c;
int numoutside = 0;
int main(){
int i,j;
double area, error, eps = 1.0e-5;
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps)
for(i = 0; i<NPOINTS; i++){
for(j=0; j < NPOINTS; j++){
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;
testpoint(c);
}
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-numoutside)/(double)(NPOINTS*NPOINTS);
error=area/(double)NPOINTS;
printf("Area of Mandlebrot set = %12.8f +/- %12.8f\n",area,error);
printf("Correct answer should be around 1.510659\n");
}
void testpoint(struct d_complex c){
// Does the iteration z=z*z+c, until |z| > 2 when point is known to be outside set
// If loop count reaches MAXITER, point is considered to be inside the set
struct d_complex z;
int iter;
double temp;
z=c;
for (iter=0; iter<MAXITER; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
z.i = z.r*z.i*2+c.i;
z.r = temp;
if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp atomic
numoutside++;
break;
}
}
}
The question I have is, could we use reduction in #pragma omp parallel of variable numoutside like:
#pragma omp parallel for default(shared) private(c,j) firstprivate(eps) reduction(+:numoutside)
without atomic construct in testpoint function?
I tested the function without atomic, and the result was different from the one I got in the first place. Why does that happen? And while I understand the concept of mutual exclusion and use of it because of race conditioning, isn't reduction just another form of solving that problem with private variables?
Thank You in advance.

OpenACC bitonic sort is much slower on GPU than on CPU

I have the following bit of code to sort double values on my GPU:
void bitonic_sort(double *data, int length) {
#pragma acc data copy(data[0:length], length)
{
int i,j,k;
for (k = 2; k <= length; k *= 2) {
for (j=k >> 1; j > 0; j = j >> 1) {
#pragma acc parallel loop gang worker vector independent
for (i = 0; i < length; i++) {
int ixj = i ^ j;
if ((ixj) > i) {
if ((i & k) == 0 && data[i] > data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
if ((i & k) != 0 && data[i] < data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
}
}
}
}
}
}
This is a bit slower on my GPU than on my CPU. I'm using GCC 6.1. I can't figure out, how to run the whole code on my GPU. So far, only the parallel loop is executed on the cpu and it switches between CPU and GPU for each one of the outer loops.
I'd like to run the whole content of the function on the GPU, but I can't figure out how. One major problem for me now is that the GCC implementation currently doesn't allow nested parallelism, so I can't use a parallel construct inside another parallel construct. Is there any way to get around that?
I've tried putting a kernels construct on top of the first loop but that slows it down by a factor of about 10. If I use a parallel construct above the first loop instead, the result isn't sorted any more, which makes sense. The two outer loops need to be executed sequentially for the algorithm to work.
If you have any other suggestions on how I could improve performance, I would be grateful as well.

Not intersecting chords on circle

I'm trying to implement the task. We have 2*n points on circle. So we can create n chords between them. Print all ways to draw n not intersecting chords.
For example: if n = 6. We can draw (1->2 3->4 5->6), (1->4, 2->3, 5->6), (1->6, 2->3, 4->5), (1->6, 2->5, 3->4)
I've developed a recursive algorithms by creating a chord from 1-> 2, 4, 6 and generating answers for 2 remaining intervals. But I know there is more efficient non-recursive way. May be by implementing NextSeq function.
Does anyone have any ideas?
UPD: I do cache intermediate results, but what I really want is to find GenerateNextSeq() function, which can generate next sequence by previous and so generate all such combinations
This is my code by the way
struct SimpleHash {
size_t operator()(const std::pair<int, int>& p) const {
return p.first ^ p.second;
}
};
struct Chord {
int p1, p2;
Chord(int x, int y) : p1(x), p2(y) {};
};
void MergeResults(const vector<vector<Chord>>& res1, const vector<vector<Chord>>& res2, vector<vector<Chord>>& res) {
res.clear();
if (res2.empty()) {
res = res1;
return;
}
for (int i = 0; i < res1.size(); i++) {
for (int k = 0; k < res2.size(); k++) {
vector<Chord> cur;
for (int j = 0; j < res1[i].size(); j++) {
cur.push_back(res1[i][j]);
}
for (int j = 0; j < res2[k].size(); j++) {
cur.push_back(res2[k][j]);
}
res.emplace_back(cur);
}
}
}
int rec = 0;
int cached = 0;
void allChordsH(vector<vector<Chord>>& res, int st, int end, unordered_map<pair<int, int>, vector<vector<Chord>>, SimpleHash>& cach) {
if (st >= end)
return;
rec++;
if (cach.count( {st, end} )) {
cached++;
res = cach[{st, end}];
return;
}
vector<vector<Chord>> res1, res2, res3, curRes;
for (int i = st+1; i <=end; i += 2) {
res1 = {{Chord(st, i)}};
allChordsH(res2, st+1, i-1, cach);
allChordsH(res3, i+1, end, cach);
MergeResults(res1, res2, curRes);
MergeResults(curRes, res3, res1);
for (auto i = 0; i < res1.size(); i++) {
res.push_back(res1[i]);
}
cach[{st, end}] = res1;
res1.clear(); res2.clear(); res3.clear(); curRes.clear();
}
}
void allChords(vector<vector<Chord>>& res, int n) {
res.clear();
unordered_map<pair<int, int>, vector<vector<Chord>>, SimpleHash> cach; // intrval => result
allChordsH(res, 1, n, cach);
return;
}

Use dynamic programming. That is, cache partial results.
Basically, start from 1 chord, compute all answers and add them to cache.
Then take 2 chords, compute all answers using the cache whenever you can.
Etc.
Recursive way is O(n!) (at least n!, I'm bad with complexity calculation).
This way is n/2-1 operations for each step and n steps, therefore O(n^2), which is much better. However, this solution depends on memory, as it has to hold all the combinations in the cache. 15 chords easily uses 1GB of memory (Java solution).
Example solution:
https://ideone.com/g81zP9
Completes 12 chord computation in ~306ms.
Given 1GB of RAM it computes 15 chords in ~8sec.
Cache is saved in specific format to optimize performance: number saved in array means how much further is the link. For example [1,0,3,1,0,0] means:
1 0 3 1 0 0
|--| | |--| |
|--------|
You can transform it in a separate step to whatever format you want.

parallelizing in openMP

I have the following code that I want to paralleize using OpenMP
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0;
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
//cout<<"row "<<m<<" completed"<<endl;
}
In this I want every thread to perform "for j" and "for k" simultaneouly.
I am trying to do using pragma omp parallel for before the "for m" loop but not getting the correct result.
How can I do this in an optimized manner. thanks in advance.

Depending exactly from which loop you want to parallelize, you have three options:
#pragma omp parallel
{
#pragma omp for // Option #1
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0;
#pragma omp for // Option #2
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
//cout<<"row "<<m<<" completed"<<endl;
}
}
//////////////////////////////////////////////////////////////////////////
// Option #3
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
#pragma omp parallel
{
double value = 0.0;
#pragma omp for
for(j=0; j<r_b; j++)
for(k=0; k<c_b; k++)
{
double a;
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else
a = h_a[((m-j)*c_a) + (n-k)];
//printf("%lf\t", a);
value += h_b[(j*c_b) + k] * a;
}
h_c[m*c_c + n] = value;
//printf("%lf\t", h_c[m*c_c + n]);
}
}
//cout<<"row "<<m<<" completed"<<endl;
}
Test and profile each. You might find that option #1 is fastest if there isn't a lot of work for each thread, or you may find that with optimizations on, there is no difference (or even a slowdown) when enabling OMP.
Edit
I've adopted the MCVE supplied in the comments as follows:
#include <iostream>
#include <chrono>
#include <omp.h>
#include <algorithm>
#include <vector>
#define W_OMP
int main(int argc, char *argv[])
{
std::vector<double> h_a(9);
std::generate(h_a.begin(), h_a.end(), std::rand);
int r_b = 500;
int c_b = r_b;
std::vector<double> h_b(r_b * c_b);
std::generate(h_b.begin(), h_b.end(), std::rand);
int r_c = 500;
int c_c = r_c;
int r_a = 3, c_a = 3;
std::vector<double> h_c(r_c * c_c);
auto start = std::chrono::system_clock::now();
#ifdef W_OMP
#pragma omp parallel
{
#endif
int m,n,j,k;
#ifdef W_OMP
#pragma omp for
#endif
for(m=0; m<r_c; m++)
{
for(n=0; n<c_c; n++)
{
double value = 0.0,a;
for(j=0; j<r_b; j++)
{
for(k=0; k<c_b; k++)
{
if((m-j)<0 || (n-k)<0 || (m-j)>r_a || (n-k)>c_a)
a = 0.0;
else a = h_a[((m-j)*c_a) + (n-k)];
value += h_b[(j*c_b) + k] * a;
}
}
h_c[m*c_c + n] = value;
}
}
#ifdef W_OMP
}
#endif
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms"
#ifdef W_OMP
"\t with OMP"
#else
"\t without OMP"
#endif
"\n";
return 0;
}
As a reference, I'm using VS2012 (OMP 2.0, grrr). I'm not sure when collapse was introduced, but apparently after 2.0. Optimizations were /O2 and compiled in Release x64.
Benchmarks
Using the original sizes of the loops (7,7,5,5) and therefore arrays, the results were 0ms without OMP and 1ms with. Verdict: optimizations were better, and the added overhead wasn't worth it. Also, the measurements are not reliable (too short).
Using the slightly larger sizes of the loops (100, 100, 100, 100) and therefore arrays, the results were about equal at about 108ms. Verdict: still not worth the naive effort, tweaking OMP parameters might tip the scale. Definitely not the x4 speedup I would hope for.
Using an even larger sizes of the loops (500, 500, 500, 500) and therefore arrays, OMP started to pull ahead. Without OMP 74.3ms, with 15s. Verdict: Worth it. Weird. I got a x5 speedup with four threads and four cores on an i5. I'm not going to try and figure out how that happened.
Summary
As has been stated in countless answers here on SO, it's not always a good idea to parallelize every for loop you come across. Things that can screw up your desired xN speedup:
Not enough work per thread to justify the overhead of creating the additional threads
The work itself is memory bound. This means that the CPU can be running at 1petaHz and you still won't see a speedup.
Memory access patterns. I'm not going to go there. Feel free to edit in the relevant info if you want it.
OMP parameters. The best choice of parameters will often be a result of this entire list (not including this item, to avoid recursion issues).
SIMD operations. Depending on what and how you're doing, the compiler may vectorize your operations. I have no idea if OMP will usurp the SIMD operations, but it is possible. Check your assembly (foreign language to me) to confirm.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parallel Bellman-Ford implementation - parallel-processing

Can anyone point me to a good pseudocode of a simple parallel shortest path algorithm? Or any language, it doesn't matter. I'm having trouble finding good examples =[

Related

I have algorithm for finding maximum flow. Does it have an author or name?

Difference between mutual exclusion like atomic and reduction in OpenMP

OpenACC bitonic sort is much slower on GPU than on CPU

Not intersecting chords on circle

parallelizing in openMP

Categories

Resources