I want to parallelize function S and lock every node but I keep getting core dump. I'm trying to use a lock in every node of the graph. It will work if I use a single lock on my nodes.
for (l = 0; l < n; l++)
omp_init_lock(&(lock[l]));
#pragma omp parallel for num_threads(16)default(none) private(v)shared(n,Xof,lock)
for(v = 0; v < n; v++) {
omp_set_lock(&(lock[v]));
if(Xof[v] == NYC)
{
S(v);
}
omp_unset_lock(&(lock[v]));
}
It seems like the most likely cause is that there is a data race (perhaps with some embedded state) within S.
A way to test this would be use n locks but only one thread. If this does not cause a core dump, then S most likely has a data race.
If it does cause a core dump, then your initial hypothesis that is due to the locks might be true. From the code visible, I don't see anything clearly wrong. I might delete the shared clause since it is redundant, but I don't expect that to have an impact.
Related
I'm using OpenMP target offloading do offload some nested loops to the gpu. I'm using the nowait to tun it asynchronous. This makes it a task. With the same input values the result differs from the one when not offloading (e.g. cpu: sum=0.99, offloading sum=0.5).
When removing the nowait clause it works just fine. So I think the issue is that it becomes an OpenMP task and I'm struggling getting it right.
#pragma omp target teams distribute parallel for reduction( +: sum) collapse(2) nowait depend(in: a, b) depend(out: sum)
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= n; j++)
{
double c = 0;
for (int k = 0; k < n; k++)
{
c += /* some computation */
}
sum += fabs(c);
}
}
The OpenMP 5.2 specification states:
The target construct generates a target task. The generated task region encloses the target region. If a depend clause is present, it is associated with the target task. [...]. If the nowait clause is present, execution of the target task may be deferred. If the nowait clause is not present, the target task is an included task.
This means that your code is executed in a task with a possibly deferred execution (with nowait). Thus, it can be executed at the end of the parallel in the worst case, but always before all the dependent tasks and taskwait directives waiting for the target task (or the ones including a similar behaviour like taskgroup). Because of that, you need not to modify the working arrays (nor release them) during this time span. If you do, the behaviour is undefined.
You should especially pay attention to the correctness of synchronization points and task dependencies in your code (it is impossible for us to check that with the current incomplete provided code).
I'm using OMP to try to get some speedup in a small kernel. It's basically just querying a vector of unordered_sets for membership. I tried to make an optimization, but surprisingly I got a slowdown, and am really curious why.
My first pass was:
vector<unordered_set<uint16_t> > setList = getData();
#pragma omp parallel for default(shared) private(i, j) schedule(dynamic, 50)
for(i = 0; i < size; i++){
for(j = 0; j < 500; j++){
count = count + setList[i].count(val[j]);
}
}
Then I thought I could maybe get a speedup by moving the setList[i] sub expression up one level of nesting and save it in a temp variable, by doing the following:
#pragma omp parallel for default(shared) private(i, j, currSet) schedule(dynamic, 50)
for(i = 0; i < size; i++){
currSet = setList[i];
for(j = 0; j < 500; j++){
count = count + currSet.count(val[j]);
}
}
I had thought this would maybe save a load each iteration of the "j" for loop, and get a speedup, but it actually SLOWED DOWN by about 3x. By this I mean the entire kernel took about 3 times as long to run. Thoughts on why this would occur?
Thanks!
Adding up a few integers is really not enough work to warrant starting threads for.
If you forget to add the reduction clause, you'll suffer from true sharing - all threads want to update that count variable at the same time. This makes all cores fight for the cache line containing tha variable, which will considerably impact your performance.
I just noticed that you set the schedule to be dynamic. You shouldn't. This workload can be divided at compile time already. So don't specify a schedule.
As has already been stated inter-loop dependencies, i.e. threads waiting for data from other threads, or data being accessed by multiple threads successively, can cause a paralleled program to experience slow down and should be avoided as a rule of thumb. Built in functions like reductions can collect individual results and compile them together in an optimised fashion.
Here is a good example of reduction being used in a similar case to yours from the university of Utah
int array[8] = { 1, 1, 1, 1, 1, 1, 1, 1};
int sum = 0, i;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < 8; i++) {
sum += array[i];
}
printf("total %d\n", sum);
source: http://www.eng.utah.edu/~cs4960-01/lecture9.pdf
as an aside: private variables need only be assigned when they are local variables inside a parallel region In both cases it is not necessary for i to be declared private.
see wikipedia: https://en.wikipedia.org/wiki/OpenMP#Data_sharing_attribute_clauses
Data sharing attribute clauses
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.
see stack exchange answer here: OpenMP: are local variables automatically private?
I have an observation about false sharing, please look this code:
struct foo {
int x;
int y;
};
static struct foo f;
/* The two following functions are running concurrently: */
int sum_a(void)
{
int s = 0;
int i;
for (i = 0; i < 1000000; ++i)
s += f.x;
return s;
}
void inc_b(void)
{
int i;
for (i = 0; i < 1000000; ++i)
++f.y;
}
Here, sum_a may need to continually re-read x from main memory
(instead of from cache) even though inc_b's concurrent modification of
y should be irrelevant.
By Oracle's website I read this explanation:
simultaneous updates of individual elements in the same cache line
coming from different processors invalidates entire cache lines, even
though these updates are logically independent of each other. Each
update of an individual element of a cache line marks the line as
invalid. Other processors accessing a different element in the same
line see the line marked as invalid. They are forced to fetch a more
recent copy of the line from memory or elsewhere, even though the
element accessed has not been modified. This is because cache
coherency is maintained on a cache-line basis, and not for individual
elements.As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.
My observation is simple:
if what i read is true, how can be possible that exist the problem about "dirty read" ???
I am writing some code that would definitively benefit from trying to integrate openmp some software that I am writing. I am new to openmp, and while testing some very basic test code (see below) I noticed that the execution times are extremely longer with openmp activated (#pragma line). Any insight is much appreciated.
int main()
{
int number=200;
int max = 2000000;
for(int t=1; t<max; t++)
{
double fac = 0.0;
#pragma omp parallel for reduction(+:fac)
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}
As currently written the code encounters the parallel region max times. The overhead of entering a parallel region in an OpenMP program is small, but you incur it 2000000 times. You don't actually tell us what the run times are, but I can readily believe that this makes the them extremely longer than the serial version. I suggest you wrap the outer loop in a parallel region, not the inner loop.
Take care when you rewrite your code to ensure that the payload inside the parallel region is significant, and returns some value(s) to the program outside the parallel region. Absent these steps a crafty optimising compiler can determine that a loop returns nothing to the rest of the program and simply optimise it away.
Also insert some timing instructions (use omp_get_wtime), rerun your code and, if matters are still not satisfactory, update your question with the new information you gather.
This is an improved code that actually works as intended. It basically wraps the outer loop, rather than the inner one. When compiled without openmp support it takes 1.49s, with openmp 0.48s.
int main()
{
int number=200;
int max = 2000000;
#pragma omp parallel for
for(int t=1; t<max; t++)
{
double fac = 0.0;
for(int n=2; n<=number; n++)
fac += 1;
}
return 0;
}
I am using intel's offload pragmas in host openMP code. The code looks as follows
int s1 = f(a,b,c);
#prama offload singnal(s1) in (...) out(x:len)
{
for (int i = 0; i < len; ++i)
{
x[i] = ...
}
}
#pragma omp parallel default(shared)
{
#pragma omp for schedule(dynamic) nowait
for (int i = 0; i < count; ++i)
{
/* code */
}
#pragma omp for schedule(dynamic)
for (int j = 0; j < count2; ++j)
{
/* code */
}
}
#pragma offload wait(s1)
{
/* code */
}
The code offload calculation of $x$ to MIC. The code keeps itself busy by assining some openMP to CPU cores. The above code works as expected. However, the first offload pragma takes a lot of time and has become the bottleneck. Nevertheless overall , it pays off to offload computation of $x$ to MIC. One way to potentially overcome this latency issue I'm trying is as follows
int s1 = f(a,b,c);
#pragma omp parallel default(shared)
{
#pragma omp single nowait
{
#prama offload singnal(s1) in (...) out(x:len)
{
for (int i = 0; i < len; ++i)
{
x[i] = ...
}
}
}
#pragma omp for schedule(dynamic) nowait
for (int i = 0; i < count; ++i)
{
/* code */
}
#pragma omp for schedule(dynamic)
for (int j = 0; j < count2; ++j)
{
/* code */
}
}
#pragma offload wait(s1)
{
/* code */
}
SO this new code, assigns a thread to do the offload while other openmp threads can be used for other worksharing constructs. However this code doesn't work. I get following error message
device 1 does not have a pending signal for wait(0x1)
Offload report points that above piece of code is the main culprit. One temporary work around is using a constant as signal i.e. signal(0), which works. However, I need a more permanent solution. Can anyone shade light on what is going wrong in my code.
Thanks
Let me complement Taylor's reply a bit.
The first offload indeed takes more time than subsequent offloads, because of the initialization stuff going on. Taylor sketched some of the things going on there. You can avoid the dummy offload by using the environment variable OFFLOAD_INIT=on_start. That should let the runtime system do all the initialization ahead of time. The overhead of this does not go away, but it moves from your first offload to the application initialization.
The problem with your second code snippet seems to be that your offloads target different devices. Signalling and waiting only works if the signal and wait happen for the same target device. Since you do not explicitly use the target(mic:0) clause with your offloads, chances are high that the runtime system selects different target devices.
One recommendation i would like to make is to not use plain integers for the signalling. Usually, the signal indicates that a certain buffer is ready. In these cases, it is good practice to use the buffer pointer as the signal handle, since it will be unique for concurrent offloads working with different buffers.
Cheers,
-michael
I can't comment on the 2nd code block. I have some observations about the first.
The first offload always takes a longer period of time since it also setups the offload infrastructure. This structure includes things such as passing environmental variables, copying over the mic implementation of libomp5, setting up the thread pool, etc.
The way to avoid this is to setup a dummy offload first, meaning it doesn't really do anything and is not part of your computation block.
An excellent set of references on optimizing for the xeon phi coprocessor is under the training tab at software.intel.com/mic-developer.
Also take a look at software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture, software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization, and software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization.
Sorry about the long URLs but stackoverflow doesn't allow me to include more than two links as I'm new.