OpenMP on Intel i7 - openmp

I have a problem with openMP in i7 CPU.
I have used openMP just to parallel a 'for' loop. Algorithm was used on several different PC's and worked without any problem. Recently, we tried to run it on i7 system and we got a problem on it. Software usually runs for some time and after several cycles it reported "not enough memory" and we tried to find a memory leak, but instead we found that the used stack size of the software was too big - there were a lot of 1Mb threads not closed. Somehow threads created by the openMP were all stuck in the stack and memory was filled with them.
Did anyone experienced ever such a behavior? The code is very simple, just a
'pragma omp parallel for'
with some cycle, which works ok on other PC's.
I am using Microsoft Visual C++ 9.0 compiler with build-in openMP library.
Thank you
Sergei

Thank you for answers. I figured out that when OpenMP starts some paralleled loop it opens several threads, which do not stop at the end, but reused on another paralleled loop. In the case of i7, they do not reused but always a new one created for each paralleled loop, so there is stable 1Mb stack grow.
I also tried to write a very simple application, which just uses openMP for to parallelize several loop and I did not observed any problem with it on i7. Looks like in the main software there are some conditions which allows such problem with parallelization. Trying to find more...

You could try using the Intel Thread Building Blocks library (TBB) it is very similar to OpenMP and really easier to parallelize a for loop in the manner you described - to see if there is any difference.

That sounds like an OS issue rather than application problem. I assume the compiler generating same assembly for the same code. If you have some old hyper-threaded cpu you can try your code, see if the same problem happens there.

Since i can't see you code i'll try to guess...
For me that sounds a bit like a problem with nested loops when using #pragma omp for.
In case you have nested loops you have to set the counter variables of inner loops as private.
Take a look at this sample:
#pragma omp for private(j)
for(i=0; i<100; i++)
{
for(j=0; j<10; j++)
{
A[i] = A[i] * 2;
}
}
The variable j is set to private to have a instance of it in every thread, not the same instance for all threads.
Check this out in your code, maybe that's the problem.
And (your compiler should tell you that) don't use break; in your parallized loops. That won't work.
Good luck!

Related

What happens when multiple GPU threads in a single warp/wave attempt to write to the same shared memory location?

I've been learning about parallel/GPU programming a lot recently, and I've encountered a situation that's stumped me. What happens when two threads in a warp/wave attempt to write to the same exact location in shared memory? Specifically, I'm confused as to how this can occur when warp threads each execute the exact same instruction at the same time (to my understanding).
For instance, say you dispatch a shader that runs 32 threads, the size of a normal non-AMD warp. Assuming no dynamic branching (which as I understand, will normally call up a second warp to execute the branched code? I could be very wrong about that), what happens if we have every single thread try to write to a single location in shared memory?
Though I believe my question applies to any kind of GPU code, here's a simple example in HLSL:
groupshared uint test_target;
#pragma kernel WarpWriteTest
[numthreads(32, 1, 1)]
void WarpWriteTest (uint thread_id: SV_GroupIndex) {
test_target = thread_id;
}
I understand this is almost certainly implementation-specific, but I'm just curious what would generally happen in a situation like this. Obviously, you'd end up with an unpredictable value stored in test_target, but what I'm really curious about is what happens on a hardware level. Does the entire warp have to wait until every write is complete, at which point it will continue executing code in lockstep (and would this result in noticeable latency)? Or is there some other mechanism to GPU shared memory/cache that I'm not understanding?
Let me clarify, I'm not asking what happens when multiple threads try to access a value in global memory/DRAM—I'd be curious to know, but my question is specifically concerned the shared memory in a threadgroup. I also apologize if this information is readily available somewhere else—as anyone reading might know, GPU terminology in general can be very nebulous and non-standardized, so I've had difficulty even knowing what I should be looking for.
Thank you so much!

Altera OpenCL parallel execution in FPGA

I have been looking into Altera OpenCL for a little while, to improve heavy computation programs by moving the computation part to FPGA. I managed to execute the vector addition example provided by Altera and seems to work fine. I've looked at the documentations for Altera OpenCL and came to know that OpenCL uses pipelined parallelism to improve performance.
I was wondering if it is possible to achieve parallel execution similar to multiple processes in VHDL executing in parallel using Altera OpenCL in FPGA. Like launching multiple kernels in one device that can execute in parallel? Is it possible? How do I check if it is supported? Any help would be appreciated.
Thanks!
The quick answer is YES.
According to the Altera OpenCL guides, there are generally two ways to achieve this:
1/ SIMD for vectorised data load/store
2/ replicate the compute resources on the device
For 1/, use num_simd_work_items and reqd_work_group_size kernel attributes, multiple work-items from the same work-group will run at the same time
For 2/, use num_compute_units kernel attribute, multiple work-groups will run at the same time
Please develop single work-item kernel first, then use 1/ to improve the kernel performance, 2/ will generally be considered at last.
By doing 1/ and 2/, there will be multiple work-groups, each with multiple work-items running at the same time on the FPGA device.
Note: Depending on the nature of the problem you are solving, may the above optimization may not always suitable.
If you're talking about replicating the kernel more than once, you can increase the number of compute units. There is a attribute that you can add before the kernel.
__attribute__((num_compute_units(N)))
__kernel void test(...){
...
}
By doing this you essentially replicate the kernel N times. However, the Programming guide states that you probably first look into using the simd attribute where it performs the same operation but over multiple data. This way, the access to global memory becomes more efficient. By increasing the number of compute units, if your kernels have global memory access, there could be contention as multiple compute units are competing for access to global memory.
You can also replicate operations at a fine-grained level by using loop unrolling. For example,
#pragma unroll N
for(short i = 0; i < N; i++)
sum[i] = a[i] + b[i]
This will essentially perform the summing of a vector by element N times in one go by creating hardware to do the addition N times. If the data is dependent on the previous iteration, then it unrolls the pipeline.
On the other hand, if your goal is to launch different kernels with different operations, you can do that by creating your kernels in an OpenCL file. When you compile the kernels, it will map and par the kernels in the file into the FPGA together. Afterwards, you just need to envoke the kernel in your host by calling clEnqueueNDRangeKernel or clEnqueueTask. The kernels will run side by side in parallel after you enqueue the commands.

Is there a way to end idle threads in GNU OpenMP?

I use OpenMP for parallel sorting at start of my program. Once data is loaded and sorted, the program runs as a daemon and OpenMP is not used any more. Is there a way to turn off the idle threads created by OpenMP? omp_set_num_threads() doesn't affect the idle threads which have already been created for a task.
Please look up OMP_WAIT_POLICY, which is new in OpenMP 4 [https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html].
There are non-portable alternatives like GOMP_SPINCOUNT if your OpenMP implementation isn't recent enough. I recall from OpenMP specification discussions that at least Intel, IBM, Cray, and Oracle support their own implementation of this feature already.
I don't believe there is a way to trigger the threads' destruction. Modern OpenMP implementations tend to keep threads around in a pool to speed up starting future parallel sections.
In your case I would recommend a two program solution (one parallel to sort and one serial for the daemon). How you communicate the data between them is up to you. You could do something simple like writing it to a file and then reading it again. This may not be as slow as it sounds since a modern linux distribution might keep that file in memory in the file cache.
If you really want to be sure it stays in memory, you could launch the two processes simultaneously and allow them to share memory and allow the first parallel sort process to exit when it is done.
In theory, OpenMP has a implicit synchronization at the end of the "pragma" clauses. So, when the OpenMP parallel work ends, all the threads are deleted. You dont need to kill them or free them: OpenMP does that automatically.
Maybe "omp_get_num_threads()" is telling to you the actual configuration of the program, not the number of active threads. I mean: if you set the number of threads to 4, omp will tell you that the configuration is "4 threads", but this does not mean that there are actually 4 threads in process.

'new' and 'delete' is not as scalable as intel thread building block scalable_malloc/free

To be clear, this is not an advertisement for tbb library. just something that I found recently that quite suprised me.
I did a litte google on heap contention. and it seems that after glibc 2.3. 'new' and 'delete' had been improved to support multiprocessors very well. my glibc is 2.5. and for following very simple code.
tbb::tick_count t1 = tbb::tick_count::now();
for (size_t i = 0; i < 100000; ++i)
{
char * str = new char [100];
delete str;
}
tbb::tick_count t2 = tbb::tick_count::now();
std::cout << "process time = " << (t2 - t1).seconds() << std::endl;
I got a Linux box with 16 CPU cores. and I started 1 and 8 threads, respectively to run above code. the first thing that supprised me is that the process time is less while there were 8 threads running. this made no sense to me. how is this even possible?
Other test I did is that instead of above simple code, each thread runs a quite complex algorithm, during the algorithm, there is quite a lot of new and delete too. and while the thread number increased from 1 to 8, the processing time almost increased by 100%.
You may ask how did I know that it's 'new' and 'delete' caused the time increasing, it's because after I replaced 'new' and 'delete' with tbb's scalable_malloc/free, the processing time only increased by around 5% when thread number was increased from 1 to 8.
here is one more mystery to me, why 'new' and 'delete' didn't scale as well as in previous simple code.
another mystery is that if I added the previous simple code at the front of the algorithm that each thread runs. then there was no time increasing at all while I increased thread number from 1 to 8.
I was so suprised by my test. Could anyone please give an explanation for my test results? many thanks here.
This is not a mystery at all. It is well known that memory allocation in multi-threaded applications suffers increasing thread blocking time (in particular, this happens for sleeping threads in the kernel TASK_UNINTERRUPTIBLE state on Linux). And allocation of memory from the heap can quickly become a bottleneck, since the standard allocator deals with multiple allocation request from several threads by serializing the requests. These are the main reasons for the degraded performances you are experiencing. Of course, this in turn leads to the implementation of efficient allocators. You cited the TBB one, but there are other freely available alternatives.
See, for instance, the ThreadAlloc library which, as stated by his author, "Provides about 10 times benefit in performance comparing with standard allocator at SMP platforms for multithreaded application intensivelly using dynamic memory allocation".
Another option is Hoard.

strange behavior of an OpenMP program

I'm debugging an OpenMP program. Its behavior is strange.
1) If a simple program P (while(1) loop) occupies one core 100%, the OpenMP program pauses even it occupies all remained cores. Once I terminate the program P, OpenMP program continues to execute.
2) The OpenMP program can execute successfully in situation 1 if I set OMP_NUMBER_THREADS to 32/16/8.
I tested on both 8-core x64 machines and 32-core Itanium machines. The former uses GCC and libomp. The later uses privately-owned aCC compiler and libraries. So it is unlikely related to compiler/library.
Could you help point out any possible reasons which may cause the scene? Why can it be affected by another program?
Thanks.
I am afraid that you need to give more information.
What is the OS you are running on?
When you run using 16 threads are you doing this on the 8-core or the 32 core machine?
What is the simple while(p) program doing in this while loop?
What is the OpenMP program doing (in general terms - if you can't be specific)?
Have you tried using a profiling tool to see what the OpenMP program is doing?

Resources