I'm doing some kind of performance evaluation using two different vanilla Linux kernels, 2.6.22 and 2.6.31, since I assume each of them uses a different scheduling mechanism: 2.6.22 uses the old O(1) scheduler, whilst 2.6.31 adopts the CFS. Could anybody confirm the correction of this assumption?
It is in linux-2.6.23 introduced the CFS ( http://kernelnewbies.org/Linux_2_6_23#head-f3a847a5aace97932f838027c93121321a6499e7 ) But the later kernel would have patches for it. IMHO, you could try a newer kernel if you want to do profiling.
Related
The vast majority of CPUs coming out nowadays contain multiple cores which can operate at the same time - in parallel.
I'm just wondering, from the point of executing a program as quickly as possible using all available CPU cores, does a programmer need to take into consideration that the software being developed will be running on a multi-core CPU? For instance, would the software being developed have to be manually configured to assign different tasks to each CPU core? Or does the OS/CPU automatically identify and choose which parts of a program can run - in parallel - on different cores?
Apologies if this may seem like a simple or silly question. I'm completely new to the topic of parallel programming and I've come across some conflicting information early on in my research - some sources state that the programmer must manually configure their software in order to utilise more than one CPU core (the more believable option in my opinion) - and other sources state that the OS/CPU automatically identifies and chooses which tasks can be run in parallel on different CPU cores (the less believable option in my opinion due to the complexity involved in automatically identifying this).
Just in case different Operating Systems, CPUs or Programming Languages perform differently in a parallel computing or multi-core environment - I will be using Windows 7 as my OS, an Intel Dual Core i7 Processor, and OpenCL as the programming language.
Any help is much appreciated.
In practice this occurs semi-automatically.
More detailed answer will depend on your application nature, preferred programming model and target architecture.
More explanation:
In order to exploit multicore hardware efficiently (in your case, keeping as much cores busy as possible) you first of all 1) need to "parallelize" algorithm itself - make it "concurrent", 2) use one of multi-threading (most often) or multi-process (rare case) parallel programming APIs, like for example "OpenMP", "Intel TBB", "OpenCL", "Posix Threads" or (for multi-process) "MPI" in order to efficiently and often automatically assign different "pieces" of your concurrent program to different threads (or, rare case, processes).
One of the simplest possible examples of such kind of parallel programming (using OpenMP) is given here.
Now, you've told that you are using OpenCL as a programming model for CPU. In certain cases, when you use vendor-provided OpenCL implementations (like Intel OpenCL) you could semi-automatically assign OpenCL kernel to be executed by various threads using "NDRange" and other OpenCL concepts, like explained here for Intel Xeon Phi co-processor (not exactly CPU-programming, but similar idea) or here (more general, but more advanced article).
However, using OpenCL as a general-purpose multi-threading programming API for CPU - is definitely not the simplest approach; and it is not always optimal in terms of final performance. There are certain application types, where OpenCL makes some little sense for general-purpose CPU multi-threading programming, but again it very much depends on your algorithm nature and target architecture..
There is one very obsolete, but still reasonable post about OpenCL vs. OpenMP/TBB on stackoverflow. This is obsolete in sense that OpenMP 4.0 now also provides solid capabilities to do Threading*+SIMD* programming (which will make you interested in some future if you explore given topic in more details). That's why I would tell that OpenMP seems to be number-one choice nowadays, bug TBB, MPI or OpenCL might also be appropriate in certain cases.
If I am running R on linux or on a mac, I can detect the number of available cores using multicore:::detectCores(). However, there's no windows version of the multicore functions, so I can't use this technique on windows.
How can I programmatically detect the number of cores on a windows machine, from within R?
The parallel package now has a function to detect the number of cores: parallel:::detectCores().
This thread has a number of suggestions, including:
Sys.getenv('NUMBER_OF_PROCESSORS')
Note also the posting in that thread by Prof. Ripley which talks to the difficulties of doing this.
If you actually need to distinguish between actual cores, chips, and logical processors, the API to call is GetLogicalProcessInformation
GetSystemInfo if just want to know how many logical processors on a machine (with no differentiation for hyperthreading.).
How you call this from "R" is beyond me. But I'd guess R has a facility for invoking code from native Windows DLLs.
GetSystemInfo will give you a structure that has the number of "processors", which corresponds to the total number of cores.
In theory, it will be the same value as the environment variable recommended in another answer, but the user can tamper with (or delete) the environment variable. That can be a bug or a feature depending on your intent.
I have a large scientific computing task that parallelizes very well with SMP, but at too fine grained a level to be easily parallelized via explicit message passing. I'd like to parallelize it across address spaces and physical machines. Is it feasible to create a scheduler that would parallelize already multithreaded code across multiple physical computers under the following conditions:
The code is already multithreaded and can scale pretty well on SMP configurations.
The fact that not all of the threads are running in the same address space or on the same physical machine must be transparent to the program, even if this comes at a significant performance penalty in some use cases.
You may assume that all of the physical machines involved are running operating systems and CPU architectures that are binary compatible.
Things like locks and atomic operations may be slow (having network latency to deal with and all) but must "just work".
Edits:
I only care about throughput, not latency.
I'm using the D programming language, and I'm almost sure there's no canned solution. I'm more interested in whether this is feasible in principle than in a particular canned solution.
My first thought is to use Apache Hadoop. It provides distributed storage and distributed computing. You can synchronize across processes by using files as locks.
It sounds like you want something like SCRAMNet, although that requires custom hardware. I don't know if there is a software-only solution. Also, it's likely that even if you got it working, you'd find your networked version was actually running slower than when it was previously on a single machine. You may just have to bite the bullet and re-design your app.
Since your point 2 suggests that you can live with some performance degradation you might want to consider a hybrid approach: SMP within individual machines, message-passing between machines. I'm not familiar with D so can offer no specific advice. Further I've seen mixed reviews of the hybrid approach for OpenMP+MPI, but it might suit you and your application.
EDIT: You might want to Google around for 'partitioned global address space' which seems to describe your desired approach quite accurately. As before, I have no advice on using D for this.
I want to write a code converter that takes an OpenMP based parallel program and runs it on a cluster.
How do I go about this problem? What libraries do I use? How do I set up a small cluster for this?
I'm finding it extremely hard to find good material about cluster computing on the internet.
EDIT: If it's impossible then how does Intel do it? The Intel compiler seems to do exactly what I want to. I don't have any specific application that I would like to run. I want to write the "converter/compiler", not the application. I understand that shared memory is different from distributed memory, but there has to be a way to sync memory, if not for all cases, then for some specific cases, even if it means that application is written with custom constructs.
Intel has an implementation of OpenMP that works with their C++ and Fortran compilers for x86 64-bit clusters. You can get a 30-day eval version of these compilers for free. Other than that, Zifre is mostly right. If you are concerned with scalability, bite the bullet and write your parallel program in another programming model (MPI, CUDA, Cilk, ...) which is designed with distributed systems in mind. If you provide a little more information about your application, we may be able to provide more useful guidance on that front.
It seems to me that this is not a good idea.
The basic idea behind OpenMP is data-shared parallel execution. It works well, when accessing shared data costs you nothing. Every thread can access a variable in shared cache or RAM.
The cluster computations exploit message-passing, because computers in cluster have distributed memory. When one process needs data from another one then you should manage data passing over the network. It is time-consuming operation.
So, if you want to write such compiler, you should implement data broadcasting operations (e.g. MPI_Bcast from MPI) for each data access in OpenMP. This will kill parallel performance at all.
This is simply not possible. You have to structure your code in a completely different way to get it to work on a cluster (programming multiple machines is very different from programming one machine).
There is no magic pixie dust to do this.
On the other hand, if you write your program with clusters in mind, it is possible to run it on a single machine (although it will obviously be slower).
SCORE/SCASH and Omni OpenMP compiler
The 1.0 spec for OpenCL just came out a few days ago (Spec is here) and I've just started to read through it. I want to know if it plays well with other high performance multiprocessing APIs like OpenMP (spec) and I want to know what I should learn. So, here are my basic questions:
If I am already using OpenMP, will that break OpenCL or vice-versa?
Is OpenCL more powerful than OpenMP? Or are they intended to be complementary?
Is there a standard way of connecting an OpenCL program to a standard C99 program (or any other language)? What is it?
Does anyone know if anyone is writing an OpenCL book? I'm reading the spec, but I've found books to be more helpful.
OpenMP and OpenCL are distinct, but can be made to work together. Neither of them should "break" the other.
For the sake of argument, let's assume there's a tradeoff between minimizing changes to an existing codebase and performance or computing power. OMP is "easy" in that you can apply it "magically" to embarrassingly parallel problems with a quick pragma or two.
OpenCL introduces brand new high-level concepts beyond typical OS threading models. Khronos probably doesn't want to say it out loud, but its genesis is in NVIDIA's CUDA. If you want to see how it works today, download the CUDA SDK and start playing. If you don't have any NVIDIA GPUs, don't worry, there's a GPU-emulator software option. OpenCL is a handy abstraction of a GPU that should apply to CPUs, DSPs, "accelerators" (Khronos' nickname for IBM's CellBE and probably Intel's Larrabee).
OpenCL is not supposed to be "written directly in C99". It's referred to as a C99 extension since its syntax is similar/identical to C99 with some new keywords. You cannot call libc (or any other library) from a kernel.
You could use both, but theoretically, OpenCL should be "better" (in that it's portable to more computing devices) if you're willing to port your code. You can not use OpenMP pragmas in an OpenCL kernel.
See also:
http://wikipedia.org/wiki/OpenCL
CUDA
LLVM
For the most part OpenMP and OpenCL are independent from each other. They are both ways of giving the developer access to parallelism on their platform.
OpenMP is designed to work well with multiple (identical) processors, where work that is approximately equal can be (nearly) automatically farmed out between them.
OpenCL is a somewhat different beast, in that it is really shines when working with special co-processor hardware. It will allow you to offload some of the heavy-duty number crunching to the GPU or some other co-processor like in the Cell. However, it was also built with the idea that it could be used to harness other main processors, as are now common in multi-core computers. I would consider this feature to be secondary, and if this is all you intend to use OpenCL for, I would not recommend using OpenCL.
That said, I'd guess it would be somewhat challenging, though definitely not impossible to get OpenMP and OpenCL to work together in the same problem.
The first thing to think about is what work you're giving to OpenCL. This would definately be a case where you would only want OpenCL to run on the GPU/Co-processor...not on the other main-processors/cores, since OpenMP is alreay using those. It wouldn't (shouldn't) cause application errors to run OpenCL and OpenMP on the same main processor, but it will cause un-desirable scheduling where both the OpenMP and OpenCL run slower because they spend a good chunk of their time switching back and fourth between each other. This would also happen if you run any other processor-hungry process on the same core at the same time.
The other big thing to think about is how you're going to schedule tasks that do run on the Co-processor. Its true that you can feed a lot of work into one of the modern GPUs, but there are lots of things to think about with the pipeline and memory usage. What you wouldn't want to happen is to have 8 different OpenMP threads each trying to send their own work to the Co-Processor at the same time. I would recommend having only one thread that manages all the interactions with the Co-Processor, so it can make sure to feed it work in an efficient manner.
That said, I'm sure there are programs that have multiple types of tasks happening at the same time, where one type of task could always be farmed out to the Co-Processor and another kind of task could be handled by the multi-core main processor. This would be a fine example of a time to mix OpenMP and OpenCL.
Good Luck!
?
?
OpenCL is supposed to be written directly in C99 afaik? There are header files available now for it anyhow.
?
By the way, there is a work about openMp to gpgpu using CUDA.