I'm reading an article about CUDA and it says "A CUDA program is a serial program with parallel kernels". My questions are:
What does it mean for it to be a serial program? I know that serial is the opposite of parallel, but what does that mean in terms of CUDA's code being run on different processors, different cores, etc? I know the point of CUDA is that it facilitates parallel programming, so I'm interested to know which part of it is serial.
What does it mean to have multiple kernels? I've always understood the kernel to be a part of the operating system, and I think CUDA is just software that runs within the operating system, right? How does CUDA have multiple kernels and how does it use them to achieve parallelism?
A CUDA kernel is written from the standpoint of a single thread. It answers the question "what will each thread do?" A CUDA kernel gives a single definition for what every thread will do. From the standpoint of a single thread, it appears to be a serial program. However it becomes parallel at launch time, when many threads execute the same code, "in parallel".
I think you're misinterpreting. CUDA has "parallel kernels" means that each kernel in CUDA has the opportunity to express (according to how it is written, and the specifics of CUDA concepts such as built-in variables) and manifest (at launch time, across many threads of execution) parallelism. It does not mean that CUDA inherently requires multiple kernels to express parallelism. A single CUDA kernel launch is inherently parallel.
You may wish to read the CUDA programming guide.
Related
The vast majority of CPUs coming out nowadays contain multiple cores which can operate at the same time - in parallel.
I'm just wondering, from the point of executing a program as quickly as possible using all available CPU cores, does a programmer need to take into consideration that the software being developed will be running on a multi-core CPU? For instance, would the software being developed have to be manually configured to assign different tasks to each CPU core? Or does the OS/CPU automatically identify and choose which parts of a program can run - in parallel - on different cores?
Apologies if this may seem like a simple or silly question. I'm completely new to the topic of parallel programming and I've come across some conflicting information early on in my research - some sources state that the programmer must manually configure their software in order to utilise more than one CPU core (the more believable option in my opinion) - and other sources state that the OS/CPU automatically identifies and chooses which tasks can be run in parallel on different CPU cores (the less believable option in my opinion due to the complexity involved in automatically identifying this).
Just in case different Operating Systems, CPUs or Programming Languages perform differently in a parallel computing or multi-core environment - I will be using Windows 7 as my OS, an Intel Dual Core i7 Processor, and OpenCL as the programming language.
Any help is much appreciated.
In practice this occurs semi-automatically.
More detailed answer will depend on your application nature, preferred programming model and target architecture.
More explanation:
In order to exploit multicore hardware efficiently (in your case, keeping as much cores busy as possible) you first of all 1) need to "parallelize" algorithm itself - make it "concurrent", 2) use one of multi-threading (most often) or multi-process (rare case) parallel programming APIs, like for example "OpenMP", "Intel TBB", "OpenCL", "Posix Threads" or (for multi-process) "MPI" in order to efficiently and often automatically assign different "pieces" of your concurrent program to different threads (or, rare case, processes).
One of the simplest possible examples of such kind of parallel programming (using OpenMP) is given here.
Now, you've told that you are using OpenCL as a programming model for CPU. In certain cases, when you use vendor-provided OpenCL implementations (like Intel OpenCL) you could semi-automatically assign OpenCL kernel to be executed by various threads using "NDRange" and other OpenCL concepts, like explained here for Intel Xeon Phi co-processor (not exactly CPU-programming, but similar idea) or here (more general, but more advanced article).
However, using OpenCL as a general-purpose multi-threading programming API for CPU - is definitely not the simplest approach; and it is not always optimal in terms of final performance. There are certain application types, where OpenCL makes some little sense for general-purpose CPU multi-threading programming, but again it very much depends on your algorithm nature and target architecture..
There is one very obsolete, but still reasonable post about OpenCL vs. OpenMP/TBB on stackoverflow. This is obsolete in sense that OpenMP 4.0 now also provides solid capabilities to do Threading*+SIMD* programming (which will make you interested in some future if you explore given topic in more details). That's why I would tell that OpenMP seems to be number-one choice nowadays, bug TBB, MPI or OpenCL might also be appropriate in certain cases.
I am using a Fortran code to run a large scale simulation on a supercomputer. I am able to run the code in serial, but I want to improve the turn around time. I am looking in to making it parallel and i have found that I can use auto-parallelization or MPI, the question I have is: which is more likely to improve the turn around time?
I was able to use Intel Fortran complier with the compiler flag -parallel -par-report to see which DO loops where made parallel, so if I run the complied code on 4 processors would that actually work or do I have to do something special?
In addition, do you know of any useful resources for me too learn MPI. I want to be able to use more processors to increase the simulation time that is my end goal.
More than likely, MPI is going to be faster than auto-parallelization. However, auto-parallelization would take about 0.5 seconds worth of work to get a speed-up of, say, 1.2 compared to Y hours (maybe even up to Q weeks) of trial-and-error debugging to get a speed-up of, say, 1.7.
If you're interested in self-learning MPI through a book, Gropp, Lusk, & Skjellum's Using MPI is probably a good start.
Answer a bit depends on nature of your hardware and your application/workload. Do you use multi-node cluster (most typical) or big shared memory machine? Assuming you are cluster user, you will have to use MPI or Fortran coarray for (more likely) distributed memory cross-node parallelism AND SOMETHING fon inter-node shared memory parallelism (SMP).
Shared memory parallelism can give you speed-up proportional to number of cores on a node(up to 32x with Xeons) or even more with coprocessors. Distributed memory parallelism can give you speedup proportional to number of nodes. Both types (or actually all 3 types) of parallelism have to be used these days to get reasonable performance. You may think of it like a hierarchy: 1.MPI or coarray on the top, 2.something for shared memory threading in the middle and 3. vectorization in the innermost level.
Well, from your question, it sounds like you are talking mostly about SMP multicore threading parallelism level. This is where -parallel Auto-Parallelization behaves. Dont expect big magic from auto-par. If you want to get better scalable parallelism, you have to try fortran OpenMP or MPI-for-shared memory. I would recommend OpenMP in most cases; its often easier to program and more performance.
But. its up to you and you really should think bigger- about all 3 levels of parallelism. If you plan to address all 3 levels, then probably optimal combination (since you are a happy intel fortran user) is 1. MPI for 1st level+ 2. OpenMP for SMP level + 3. AutoVectorization guided by OpenMP 4.0 pragma simd on 3rd level. Im not an expert in coarray, but it might be good alternative to 1.MPI.
My answer does make less sence if you dont deal with classic cluster hardware.
I need to write an application that hashes words from a dictionary to make WPA pre-shared-keys. This is my thesis for a "Networking Security" course. The application needs to be parallel for increased performance. I have some experience with MPI from my IT studies but I would like to tie it up with CUDA. The idea is to use MPI to distribute the load evenly to the nodes of the cluster and then utilize CUDA to run the individual chunks in parallel inside the GPUs of the nodes.
Distributing the load with MPI is something I can easily do and have done in the past. Also computing with CUDA is something I can learn. There is also a project (pyrit) that does more or less what I need to do (actually a lot more) and I can get ideas from there.
I would like some advice on how to make the connection between MPI and CUDA. If there is somebody that has built anything like this I would greatly appreciate his advice and suggestions. Also if you happen to know of any resources on the topic please do point them to me.
Sorry for the lengthy intro but I thought it was necessary to give some background.
This question is largerly open-ended and so it's hard to give a definitive answer. This one is just a summary of the comments made High Performance Mark, me and Jonathan Dursi. I do not claim authorship and thus made this answer a community wiki.
MPI and CUDA are orthogonal. The former is an IPC middleware and is used to communicate between processes (possibly residing on separate nodes) while the latter provides highly data-parallel shared-memory computing to each process that uses it. You can break the task into many small subtasks and use MPI to distribute them to worker processes running on the network. The master/worker approach is suitable for this kind of application, especially if words in the dictionary vary greatly in their length and variance in processing time is to be expected. Provided with all the necessary input values, worker processes can then use CUDA to perform the necessary computations in parallel and then return results back using MPI. MPI also provides the mechanisms necessary to launch and control multinode jobs.
Although MPI and CUDA could be used separately, modern MPI implementations provide some mechanisms that blur the boundaries between those two. It could be either direct support for device pointers in MPI communication operations that transparently call CUDA functions to copy memory when necessary or it could be even support for RDMA to/from device memory without intermediate copy to main memory. The former simplifies your code while the latter can save different amount of time, depending on how your algorithm is structured. The latter also requires both failry new CUDA hardware and drivers and newer networking equipment (e.g. newer InfiniBand HCA).
MPI libraries that support direct GPU memory operations include MVAPICH2 and the trunk SVN version of Open MPI.
So I've looked around online for some time to no avail. I'm new to using OpenMP and so not sure of the terminology here, but is there a way to figure out a specific machine's mapping from OMPThread (given by omp_get_thread_num();) and the physical cores on which the threads will run?
Also I was interested in how exactly OMP assigned threads, for example is thread 0 always going to run in the same location when the same code is run on the same machine? Thanks.
Typically, the OS takes care of assigning threads to cores, including with OpenMP. This is by design, and a good thing - you normally would want the OS to be able to move a thread across cores (transparently to your application) as required, since it will interrupt your application at times.
Certain operating system APIs will allow thread affinity to be set. For example, on Windows, you can use SetThreadAffinityMask to force a thread onto a specific core.
Most of the time Reed is correct, OpenMP doesn't care about the assignment of threads to cores (or processors). However, because of things like cache reuse and data locality we have found that there are many cases where having the threads assigned to cores increases the performance of OpenMP. Therefore if you look at most OpenMP implementations, you will find that there is usually some environment variable that can be set to "bind" threads to cores. The OpenMP ARB has not yet specified any "standard" way of doing this, so at this time it is left up to an OpenMP implementation to decide if and how this should be done. There has been a great deal of discussion about whether this should be included in the OpenMP spec or not and if so how it could best be done.
The 1.0 spec for OpenCL just came out a few days ago (Spec is here) and I've just started to read through it. I want to know if it plays well with other high performance multiprocessing APIs like OpenMP (spec) and I want to know what I should learn. So, here are my basic questions:
If I am already using OpenMP, will that break OpenCL or vice-versa?
Is OpenCL more powerful than OpenMP? Or are they intended to be complementary?
Is there a standard way of connecting an OpenCL program to a standard C99 program (or any other language)? What is it?
Does anyone know if anyone is writing an OpenCL book? I'm reading the spec, but I've found books to be more helpful.
OpenMP and OpenCL are distinct, but can be made to work together. Neither of them should "break" the other.
For the sake of argument, let's assume there's a tradeoff between minimizing changes to an existing codebase and performance or computing power. OMP is "easy" in that you can apply it "magically" to embarrassingly parallel problems with a quick pragma or two.
OpenCL introduces brand new high-level concepts beyond typical OS threading models. Khronos probably doesn't want to say it out loud, but its genesis is in NVIDIA's CUDA. If you want to see how it works today, download the CUDA SDK and start playing. If you don't have any NVIDIA GPUs, don't worry, there's a GPU-emulator software option. OpenCL is a handy abstraction of a GPU that should apply to CPUs, DSPs, "accelerators" (Khronos' nickname for IBM's CellBE and probably Intel's Larrabee).
OpenCL is not supposed to be "written directly in C99". It's referred to as a C99 extension since its syntax is similar/identical to C99 with some new keywords. You cannot call libc (or any other library) from a kernel.
You could use both, but theoretically, OpenCL should be "better" (in that it's portable to more computing devices) if you're willing to port your code. You can not use OpenMP pragmas in an OpenCL kernel.
See also:
http://wikipedia.org/wiki/OpenCL
CUDA
LLVM
For the most part OpenMP and OpenCL are independent from each other. They are both ways of giving the developer access to parallelism on their platform.
OpenMP is designed to work well with multiple (identical) processors, where work that is approximately equal can be (nearly) automatically farmed out between them.
OpenCL is a somewhat different beast, in that it is really shines when working with special co-processor hardware. It will allow you to offload some of the heavy-duty number crunching to the GPU or some other co-processor like in the Cell. However, it was also built with the idea that it could be used to harness other main processors, as are now common in multi-core computers. I would consider this feature to be secondary, and if this is all you intend to use OpenCL for, I would not recommend using OpenCL.
That said, I'd guess it would be somewhat challenging, though definitely not impossible to get OpenMP and OpenCL to work together in the same problem.
The first thing to think about is what work you're giving to OpenCL. This would definately be a case where you would only want OpenCL to run on the GPU/Co-processor...not on the other main-processors/cores, since OpenMP is alreay using those. It wouldn't (shouldn't) cause application errors to run OpenCL and OpenMP on the same main processor, but it will cause un-desirable scheduling where both the OpenMP and OpenCL run slower because they spend a good chunk of their time switching back and fourth between each other. This would also happen if you run any other processor-hungry process on the same core at the same time.
The other big thing to think about is how you're going to schedule tasks that do run on the Co-processor. Its true that you can feed a lot of work into one of the modern GPUs, but there are lots of things to think about with the pipeline and memory usage. What you wouldn't want to happen is to have 8 different OpenMP threads each trying to send their own work to the Co-Processor at the same time. I would recommend having only one thread that manages all the interactions with the Co-Processor, so it can make sure to feed it work in an efficient manner.
That said, I'm sure there are programs that have multiple types of tasks happening at the same time, where one type of task could always be farmed out to the Co-Processor and another kind of task could be handled by the multi-core main processor. This would be a fine example of a time to mix OpenMP and OpenCL.
Good Luck!
?
?
OpenCL is supposed to be written directly in C99 afaik? There are header files available now for it anyhow.
?
By the way, there is a work about openMp to gpgpu using CUDA.