What are the gcc command line statements to know the pthread calls for openmp directives? I know about the -fdump command line statements for generating IR file in assembly, gimple, rtl, trees. But I am unable to get any pthread dumps for openmp directives.
GCC does not directly convert OpenMP pragmas into Pthreads code. Rather it converts each OpenMP construct into a set of calls to the GNU OpenMP run-time library libgomp. You could get the intermediate representation by compiling with -fdump-tree-all. Look for a file (or files) with extension .ompexp.
Example:
#include <stdio.h>
int main() {
int i;
#pragma omp parallel for
for(i=0; i<100; i++) {
printf("asdf\n");
}
}
The corresponding section of the .ompexp file that implements the parallel region:
<bb 2>:
__builtin_GOMP_parallel_start (main.omp_fn.0, 0B, 0);
main.omp_fn.0 (0B);
__builtin_GOMP_parallel_end ();
GCC implements parallel regions via code outlining and in that case main.omp_fn.0 is the function that contains the body of the parallel region. In the function itself (omitted here for brevity) the for worksharing construct is implemented by using some simple mathematical calculations that determine the range of iterations for the corresponding thread.
Related
I have a shared variable s and private variable p inside parallel region.
How can I do the following atomically (or at least better than with critical section):
if ( p > s )
s = p;
else
p = s;
I.e., I need to update the global maximum (if local maximum is better) or read it, if it was updated by another thread.
OpenMP 5.1 introduced the compare clause which allows compare-and-swap (CAS) operations such as
#pragma omp atomic compare
if (s < p) s = p;
In combination with a capture clause, you should be able to achieve what you want:
int s_cap;
// here we capture the shared variable and also update it if p is larger
#pragma omp atomic compare capture
{
s_cap = s;
if (s < p) s = p;
}
// update p if the captured shared value is larger
if (s_cap > p) p = s_cap;
The only problem? The 5.1 spec is very new and, as of today (2020-11-27), none of the widespread compilers, i.e., those available on Godbolt, supports OpenMP 5.1. See here for a more or less up-to-date list. Adding compare is still listed as an unclaimed task on Clang's OpenMP page. GCC is still working on full OpenMP 5.0 support and the trunk build on Godbolt doesn't recognise compare. Intel's oneAPI compiler may or may not support it - it's not available on Godbolt and I can't get it to compile OpenMP code.
Your best bet for now is to use atomic capture combined with a compiler-specific CAS atomic, possibly in a loop.
I'm trying to use openmp in cython. I need to do two things in cython:
i) use the #pragma omp single{} scope in my cython code.
ii) use the #pragma omp barrier{}
Does anyone know how to do this in cython?
Here are more details. I have a nogil cdef-function my_fun() which I call in an omp for-loop:
from cython.parallel cimport prange
cimport openmp
cdef int i
with nogil:
for i in prange(10,schedule='static', num_threads=10):
my_func(i)
Inside my_func I need to place a barrier to wait for all threads to catch up, then execute a time-consuming operation only in one of the threads and with the gil acquired, and then release the barrier so all threads resume simultaneously.
cdef int my_func(...) nogil:
...
# put a barrier until all threads catch up, e.g. #pragma omp barrier
with gil:
# execute time consuming operation in one thread only, e.g. pragma omp single{}
# remove barrier after the above single thread has finished and continue the operation over all threads in parallel, e.g. #pragma omp barrier
...
Cython has some support for openmp, but it is probably easier to code in C and to wrap resulting code with Cython if openmp-pragmas are used extensively.
As alternative, you could use verbatim-C-code and tricks with defines to bring some of the functionality to Cython, but using of pragmas in defines isn't straight forward (_Pragma is a C99-solution, MSVC doing its own thing as always with __pragma), there are some examples as proof of concept for Linux/gcc:
cdef extern from *:
"""
#define START_OMP_PARALLEL_PRAGMA() _Pragma("omp parallel") {
#define END_OMP_PRAGMA() }
#define START_OMP_SINGLE_PRAGMA() _Pragma("omp single") {
#define START_OMP_CRITICAL_PRAGMA() _Pragma("omp critical") {
"""
void START_OMP_PARALLEL_PRAGMA() nogil
void END_OMP_PRAGMA() nogil
void START_OMP_SINGLE_PRAGMA() nogil
void START_OMP_CRITICAL_PRAGMA() nogil
we make Cython believe, that START_OMP_PARALLEL_PRAGMA() and Co. are nogil-function, so it put them into C-code and thus they get pick up by the preprocessor.
We must use the syntax
#pragma omp single{
//do_something
}
and not
#pragma omp single
do_something
because of the way Cython generates C-code.
The usage could look as follows (I'm avoiding here from cython.parallel.parallel as it does too much magic for this simple example):
%%cython -c=-fopenmp --link-args=-fopenmp
cdef extern from *:# as listed above
...
def test_omp():
cdef int a=0
cdef int b=0
with nogil:
START_OMP_PARALLEL_PRAGMA()
START_OMP_SINGLE_PRAGMA()
a+=1
END_OMP_PRAGMA()
START_OMP_CRITICAL_PRAGMA()
b+=1
END_OMP_PRAGMA() # CRITICAL
END_OMP_PRAGMA() # PARALLEL
print(a,b)
Calling test_omp prints "1 2" on my machine with 2 threads, as expected (one could change the number of threads using openmp.omp_set_num_threads(10)).
However, the above is still very brittle - some error checking by Cython can lead to invalid code (Cython uses goto for control flow and it is not possible to jump out of openmp-block). Something like this happens in your example:
cimport numpy as np
import numpy as np
def test_omp2():
cdef np.int_t[:] a=np.zeros(1,dtype=int)
START_OMP_SINGLE_PRAGMA()
a[0]+=1
END_OMP_PRAGMA()
print(a)
Because of bounding checking, Cython will produce:
START_OMP_SINGLE_PRAGMA();
...
//check bounds:
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
__PYX_ERR(0, 30, __pyx_L1_error) // HERE WE GO A GOTO!
}
...
END_OMP_PRAGMA();
In this special case setting boundcheck to false, i.e.
cimport cython
#cython.boundscheck(False)
def test_omp2():
...
would solve the issue for the above example, but probably not in general.
Once again: using openmp in C (and wrapping the functionality with Cython) is a more enjoyable experience.
As a side note: Python-threads (the ones governed by GIL) and openmp-threads are different and know nothing about eachother. The above example would also work (compile and run) correctly without releasing the GIL - openmp-threads do not care about GIL, but as there are no Python-objects involved nothing can go wrong. Thus I have added nogil to the wrapped "functions", so it can also be used in nogil blocks.
However, when code gets more complicated it becomes less obvious, that the variables shared between different Python-threads aren't accessed (all above because those accesses could happen in the generated C-code and this doesn't become clear from the Cython-code), it might be wiser not to release gil, while using openmp.
how can I disable optimisations with TASKING compiler ? I'm using eclipse IDE
I've read in the documentation that I could use #pragma but didnt understand how
If you specify a certain optimization, all code in the module is subject to that optimization. Within the C
source file you can overrule the C compiler options for optimizations with #pragma optimize flag
and #pragma endoptimize. Nesting is allowed:
#pragma optimize e /* Enable expression
... simplification */
... C source ...
...
It seems the TASKING compiler is compatible with GCC with respect to optimization level flags, per this user guide (which is indeed quite old).
For disabling optimizations altogether, select None (-O0) as optimization level in the C/C++ project settings. Note that -O0 is the default optimization level of the Debug configuration.
Screenshot (Eclipse Oxygen):
If you wish to disable optimizations for a specific part of your C/C++ code, such as a specific function, then the pragma comes handy. For doing so place #pragma optimize 0 before the start of the code, and #pragma endoptimize after the end of it.
For example:
#pragma optimize 0
void myfunc()
{
// function body
}
#pragma endoptimize
This is related to How to disable OMP in a translation unit at the source file?. The patch I am working on has the following due to benchmarking results. It appears we need the ability to turn off OMP on the translation unit:
static const bool CRYPTOPP_RW_USE_OMP = true;
...
ModularArithmetic modp(m_p), modq(m_q);
#pragma omp parallel sections if(CRYPTOPP_RW_USE_OMP)
{
#pragma omp section
m_pre_2_9p = modp.Exponentiate(2, (9 * m_p - 11)/8);
#pragma omp section
m_pre_2_3q = modq.Exponentiate(2, (3 * m_q - 5)/8);
#pragma omp section
m_pre_q_p = modp.Exponentiate(m_q, m_p - 2);
}
The patch also applies to a cross platform library (Linux, Unix, Solaris, BSDs, OS X and Windows), and it supports a lot of older compilers. I need to ensure that I don't break a compile.
Question: how portable is the #pragma omp parallel sections if(CRYPTOPP_RW_USE_OMP)? Will using it break compiles that used to work with just #pragma omp parallel sections?
I tried looking at past OpenMP specifications, like 2.0, but I can't see where its allowed in the grammar (see Appendix C). The closest I could find is the parallel-directive production (line 22), which leads to parallel-clause (line 24) and then unique-parallel-clause.
And looking at documentation for platforms I can't test on, its not clear to me if its available. For example, Microsoft's documentation for Visual Studio 2005 appears to only allow it on a loop.
In the very document you link, page 8, section 2.2 parallel Construct. if is among the available clauses (the first one). It is part of the standard, so portable across all conforming compilers.
In your MSDN link:
if applies to the following directives:
parallel
for (OpenMP)
sections (OpenMP)
I am totally a beginner on opencl, I searched around the internet and found some "helloworld" demos for opencl project. Usually in such sort of minimal project, there is a *.cl file contains some sort of opencl kernels and a *.c file contains the main function. Then the question is how do I compile this kind of project use a command line. I know I should use some sort of -lOpenCL flag on linux and -framework OpenCL on mac. But I have no idea to link the *.cl kernel to my main source file. Thank you for any comments or useful links.
In OpenCL, the .cl files that contain device kernel codes are usually being compiled and built at run-time. It means somewhere in your host OpenCL program, you'll have to compile and build your device program to be able to use it. This feature enables maximum portability.
Let's consider an example I collected from two books. Below is a very simple OpenCL kernel adding two numbers from two global arrays and saving them in another global array. I save this code in a file named vector_add_kernel.cl.
kernel void vecadd( global int* A, global int* B, global int* C ) {
const int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Below is the host code written in C++ that exploits OpenCL C++ API. I save it in a file named ocl_vector_addition.cpp beside where I saved my .cl file.
#include <iostream>
#include <fstream>
#include <string>
#include <memory>
#include <stdlib.h>
#define __CL_ENABLE_EXCEPTIONS
#if defined(__APPLE__) || defined(__MACOSX)
#include <OpenCL/cl.cpp>
#else
#include <CL/cl.hpp>
#endif
int main( int argc, char** argv ) {
const int N_ELEMENTS=1024*1024;
unsigned int platform_id=0, device_id=0;
try{
std::unique_ptr<int[]> A(new int[N_ELEMENTS]); // Or you can use simple dynamic arrays like: int* A = new int[N_ELEMENTS];
std::unique_ptr<int[]> B(new int[N_ELEMENTS]);
std::unique_ptr<int[]> C(new int[N_ELEMENTS]);
for( int i = 0; i < N_ELEMENTS; ++i ) {
A[i] = i;
B[i] = i;
}
// Query for platforms
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// Get a list of devices on this platform
std::vector<cl::Device> devices;
platforms[platform_id].getDevices(CL_DEVICE_TYPE_GPU|CL_DEVICE_TYPE_CPU, &devices); // Select the platform.
// Create a context
cl::Context context(devices);
// Create a command queue
cl::CommandQueue queue = cl::CommandQueue( context, devices[device_id] ); // Select the device.
// Create the memory buffers
cl::Buffer bufferA=cl::Buffer(context, CL_MEM_READ_ONLY, N_ELEMENTS * sizeof(int));
cl::Buffer bufferB=cl::Buffer(context, CL_MEM_READ_ONLY, N_ELEMENTS * sizeof(int));
cl::Buffer bufferC=cl::Buffer(context, CL_MEM_WRITE_ONLY, N_ELEMENTS * sizeof(int));
// Copy the input data to the input buffers using the command queue.
queue.enqueueWriteBuffer( bufferA, CL_FALSE, 0, N_ELEMENTS * sizeof(int), A.get() );
queue.enqueueWriteBuffer( bufferB, CL_FALSE, 0, N_ELEMENTS * sizeof(int), B.get() );
// Read the program source
std::ifstream sourceFile("vector_add_kernel.cl");
std::string sourceCode( std::istreambuf_iterator<char>(sourceFile), (std::istreambuf_iterator<char>()));
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()));
// Make program from the source code
cl::Program program=cl::Program(context, source);
// Build the program for the devices
program.build(devices);
// Make kernel
cl::Kernel vecadd_kernel(program, "vecadd");
// Set the kernel arguments
vecadd_kernel.setArg( 0, bufferA );
vecadd_kernel.setArg( 1, bufferB );
vecadd_kernel.setArg( 2, bufferC );
// Execute the kernel
cl::NDRange global( N_ELEMENTS );
cl::NDRange local( 256 );
queue.enqueueNDRangeKernel( vecadd_kernel, cl::NullRange, global, local );
// Copy the output data back to the host
queue.enqueueReadBuffer( bufferC, CL_TRUE, 0, N_ELEMENTS * sizeof(int), C.get() );
// Verify the result
bool result=true;
for (int i=0; i<N_ELEMENTS; i ++)
if (C[i] !=A[i]+B[i]) {
result=false;
break;
}
if (result)
std::cout<< "Success!\n";
else
std::cout<< "Failed!\n";
}
catch(cl::Error err) {
std::cout << "Error: " << err.what() << "(" << err.err() << ")" << std::endl;
return( EXIT_FAILURE );
}
std::cout << "Done.\n";
return( EXIT_SUCCESS );
}
I compile this code on a machine with Ubuntu 12.04 like this:
g++ ocl_vector_addition.cpp -lOpenCL -std=c++11 -o ocl_vector_addition.o
It produces a ocl_vector_addition.o, which when I run, shows successful output. If you look at the compilation command, you see we have not passed anything about our .cl file. We only have used -lOpenCL flag to enable OpenCL library for our program. Also, don't get distracted by -std=c++11 command. Because I used std::unique_ptr in the host code, I had to use this flag for a successful compile.
So where is this .cl file being used? If you look at the host code, you'll find four parts that I repeat in below numbered:
// 1. Read the program source
std::ifstream sourceFile("vector_add_kernel.cl");
std::string sourceCode( std::istreambuf_iterator<char>(sourceFile), (std::istreambuf_iterator<char>()));
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()));
// 2. Make program from the source code
cl::Program program=cl::Program(context, source);
// 3. Build the program for the devices
program.build(devices);
// 4. Make kernel
cl::Kernel vecadd_kernel(program, "vecadd");
In the 1st step, we read the content of the file that holds our device code and put it into a std::string named sourceCode. Then we make a pair of the string and its length and save it to source which has the type cl::Program::Sources. After we prepared the code, we make a cl::program object named program for the context and load the source code into the program object. The 3rd step is the one in which the OpenCL code gets compiled (and linked) for the device. Since the device code is built in the 3rd step, we can create a kernel object named vecadd_kernel and associate the kernel named vecadd inside it with our cl::kernel object. This was pretty much the set of steps involved in compiling a .cl file in a program.
The program I showed and explained about creates the device program from the kernel source code. Another option is to use binaries instead. Using binary program enhances application loading time and allows binary distribution of the program but limits portability since binaries that work fine on one device may not work on another device. Creating program using source code and binary are also called offline and online compilation respectively (more information here). I skip it here since the answer is already too long.
My answer comes four years late. Nevertheless, I have something to add that complements #Farzad's answer, as follows.
Confusingly, in OpenCL practice, the verb to compile is used to mean two different, incompatible things:
In one usage, to compile means what you already think that it means. It means to build at build-time, as from *.c sources to produce *.o objects for build-time linking.
However, in another usage—and this other usage may be unfamiliar to you—to compile means to interpret at run time, as from *.cl sources, producing GPU machine code.
One happens at build-time. The other happens at run-time.
It might have been less confusing had two different verbs been introduced, but that is not how the terminology has evolved. Conventionally, the verb to compile is used for both.
If unsure, then try this experiment: rename your *.cl file so that your other source files cannot find it, then build.
See? It builds fine, doesn't it?
This is because the *.cl file is not consulted at build time. Only later, when you try to execute the binary executable, does the program fail.
If it helps, you can think of the *.cl file as though it were a data file or a configuration file or even a script. It isn't literally a data file, a configuration file or a script, perhaps, for it does eventually get compiled to a kind of machine code, but the machine code is GPU code and it is not made from the *.cl program text until run-time. Moreover, at run-time, your C compiler as such is not involved. Rather, it is your OpenCL library that does the building.
It took me a fairly long time to straighten these concepts in my mind, mostly because—like you—I had long been familiar with the stages of the C/C++ build cycle; and, therefore, I had thought that I knew what words like to compile meant. Once your mind has the words and concepts straight, the various OpenCL documentation begins to make sense, and you can start work.