I tried the following "hello world" code, first on my system(8 cores) then on server(160 cores):
int main(int argc, char *argv[]) {
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
double t;
t=MPI_Wtime();
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
//printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
printf("%f---%d---%s\n",MPI_Wtime()-t,rank,processor_name);
sleep(.5);//to make sure, each process needs a significant amount of time to finish
MPI_Finalize();
}
I run the program with 160 processes using mpirun -np 160 ./hello
I expected server run to be more efficient as it have a single core available for each process at the starting point, but the result was opposite.
8 cores : 2.25 sec
160 cores : 5.65 sec
Please correct me, if I am confused regarding the core assignment to each process.
Also please explain how the mapping is done by default? I know there are several ways to do it manually, either by using a rankfile or using some options related to socket/core affinity.
I want to know how the processes are treated in openMPI and how they are given resources by default?
You're not actually measuring the performance of anything that would benefit from scale here. The only thing you're measuring is the startup time. In this case, you would expect that starting more processes would take more time. You have to launch the processes, wire-up the network connections, etc. Also, both your laptop and the server have one process per core so that doesn't change from one to the other.
A better measurement of testing whether having more cores is more efficient is to do some sort of sample calculation and measure the speedup from having more cores. You could try the traditional PI calculation.
Related
My pc has a 10th gen Core i7 vPRO with virtualization enabled. 8 cores + 8 virtual cores. (i7-10875H, Comet Lake)
Each physical core is split into pairs, so Core 1 hosts virtual cores 0 & 1, core 2 hosts virtual cores 2 & 3. I've noticed that in task manager, the first item of each core pair seems to be the preferred core, judging by the higher usage. I do set some affinities manually for certain heavy programs but I always set these in groups of 4, either from 0-3, 4-7, 8-11, 12-15, and never mismatch different logical processors.
I'm wondering why this behaviour happens - do the even numbered cores equate to physical cores, which could be slightly faster? If so, would I get slightly better clock speeds without virtualisation if I'm running programs that don't have a high thread count?
In general (for "scheduler theory"):
if you care about performance, spread the tasks across physical cores where possible. This prevents a "2 tasks run slower because they're sharing a physical core, while a whole physical core is idle" situation.
if you care about power consumption and not performance, make tasks use logical processors in the same physical core where possible. This may allow you to put entire core/s into a very power efficient "do nothing" state.
if you care about security (and not performance or power consumption), don't let unrelated tasks use logical processors in the same physical core at all (because information, like what kinds of instructions are currently being used, can be "leaked" from one logical processor to another logical process in the same physical core). Note that it would be fine for related tasks to use logical processes in the same physical core (e.g. 2 threads that belong to the same process that do trust each other, but not threads that belong to different processes that don't trust each other).
Of course a good OS would know the preference for each task (if each task cares about performance or power consumption or security), and would make intelligent decisions to handle a mixture of tasks with difference preferences. Sadly there are no good operating systems - most operating systems and APIs were designed in the 1990s or earlier (back when SMP was just starting and all CPUs were identical anyway) and lack the information about tasks that would be necessary to make intelligent decisions; so they assume performance is the only thing that matters for all tasks, leading to the "tasks spread across physical cores where possible, even when it's not ideal" behavior you're seeing.
My guess is that's due to hyperthreading.
Hyperthreading doesn't double CPU capacity (according to Intel, it adds ~30% on average), so it makes sense to spread the work among physical cores first, and use hyperthreading as a last resort when the overall CPU demand starts exceeding 50%.
Fun fact: a reported 50% overall CPU load on a hyperthreaded system is in fact a load of around ~70%, and the remaining 50% equate to the remaining ~30%.
If we query the OS to see how logical processors are assigned to cores1, we will see a situation like this:
Core 0: mask 0x3
Core 1: mask 0xc
Core 2: mask 0x30
Core 3: mask 0xc0
. . .
That means logical processors 0 and 1 are on core 0, 2 and 3 on core 1, etc.
You can disable hyperthreading in the BIOS. But since it adds performance, it's is a nice to have feature. Just need to be careful not to pin work such that it is competing for the same core.
1 To check core assignment I use a small C program below. The information might also be available via WMIC.
#include <stdio.h>
#include <stdlib.h>
#undef _WIN32_WINNT
#define _WIN32_WINNT 0x601
#include <Windows.h>
int main() {
DWORD len = 65536;
char *buf = (char*)malloc(len);
if (!GetLogicalProcessorInformationEx(RelationProcessorCore, (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf, &len)) {
return GetLastError();
}
union {
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info;
PBYTE infob;
};
info = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf;
for (size_t i = 0, n = 0; n < len; i++, n += info->Size, infob += info->Size) {
switch (info->Relationship) {
case RelationProcessorCore:
printf("Core %zd:", i);
for (int j = 0; j < info->Processor.GroupCount; j++)
printf(" mask 0x%llx", info->Processor.GroupMask[j].Mask);
printf("\n");
break;
}
}
return 0;
}
Why does this program print the result as 64 and not 5000? If the count variable is being updated in the critical section, I would expect that only one thread should have access to it at any given point in time. Thus, each thread would be able to increment the count, and produce the result 5000, so why do I get 64 instead in answer?
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
int count = 0;
omp_set_num_threads(5000);
#pragma omp parallel
{
#pragma omp critical
{
count++;
}
}
cout << "count = " << count << endl;
system("pause");
return 0;
}
As Michael Dussere points out, you're getting 64 as an answer because your implementation is only launching 64 threads. It may be using an internal default value to limit the max number of threads (try varying the environment variable OMP_THREAD_LIMIT, or calling omp_get_thread_limit() to see if that is the case.)
The reason for such a limit is that creating threads requires resources - each thread has to have its own stack space, process table entries on linux, etc. These aren't lightweight stateless Erlang threads that are scheduled in user space. On my 8-core system using gcc or icpc, setting the thread number to anything 1024 or above simply fails due to lack of resources, although setting system parameters can shift that limitation around.
Between the resources required by the threads and the fact that most single-image system have significantly fewer than 5000 cores, it's not clear what you'd be able to accomplish with 5000 threads on most systems.
The value you can set with omp_set_num_threads is not unlimitted.
It depends of the OpemMP implementation you use, the number of cores of you computer and so on.
You get 64 because there should be 64 threads in the current thread team. You can check with omp_get_num_threads.
I'm looking into OpenCL, and I'm a little confused why this kernel is running so slowly, compared to how I would expect it to run. Here's the kernel:
__kernel void copy(
const __global char* pSrc,
__global __write_only char* pDst,
int length)
{
const int tid = get_global_id(0);
if(tid < length) {
pDst[tid] = pSrc[tid];
}
}
I've created the buffers in the following way:
char* out = new char[2048*2048];
cl::Buffer(
context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
length,
out);
Ditto for the input buffer, except that I've initialized the in pointer to random values. Finally, I run the kernel this way:
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(1),
NULL,
&event);
event.wait();
On average, the time is around 75 milliseconds, as calculated by:
cl_ulong startTime = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong endTime = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << (endTime - startTime) * SECONDS_PER_NANO / SECONDS_PER_MILLI << "\n";
I'm running Windows 7, with an Intel i5-3450 chip (Sandy Bridge architecture). For comparison, the "direct" way of doing the copy takes less than 5 milliseconds. I don't think the event.getProfilingInfo includes the communication time between the host and device. Thoughts?
EDIT:
At the suggestion of ananthonline, I changed the kernel to use float4s instead of chars, and that dropped the average run time to about 50 millis. Still not as fast as I would have hoped, but an improvement. Thanks ananthonline!
I think your main problem is the 2048*2048 work groups you are using. The opencl drivers on your system have to manage a lot more overhead if you have this many single-item work groups. This would be especially bad if you were to execute this program using a gpu, because you would get a very low level of saturation of the hardware.
Optimization: call your kernel with larger work groups. You don't even have to change your existing kernel. see question: What should this size be? I have used 64 below as an example. 64 happens to be a decent number on most hardware.
cl::size_t myOptimalGroupSize = 64;
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(myOptimalGroupSize),
NULL,
&event);
event.wait();
You should also get your kernel to do more than copy a single value. I have given an answer to a similar question about global memory over here.
CPUs are very different from GPUs. Running this on an x86 CPU, the best way to achieve decent performance would be to use double16 (the largest data type) instead of char or float4 (as suggested by someone else).
In my little experience with OpenCL on CPU, I have never reached performance levels that I could get with an OpenMP parallelization.
The best way to do a copy in parallel with a CPU would be to divide the block to copy into a small number of large sub-block, and let each thread copy a sub-block.
The GPU approach is orthogonal: each thread participates in the copy of the same block.
This is because on GPUs, different thread can access contiguous memory regions efficicently (coalescing).
To do an efficient copy on CPU with OpenCL, use a loop inside your kernel to copy contiguous data. And then use a workgroup size not larger than the number of available cores.
I believe it is the cl::NDRange(1) which is telling the runtime to use single item work groups. This is not efficient. In the C API you can pass NULL for this to leave the work group size up to the runtime; there should be a way to do that in the C++ API as well (perhaps also just NULL). This should be faster on the CPU; it certainly will be on a GPU.
I will first give some background about the problem I'm having so you know what I'm trying to do. I have been helping out with the development of a certain software tool and found out that we could benefit greatly from using OpenMP to parallelize some of the biggest loops in this software. We actually parallelized the loops successfully and with just two cores the loops executed 30% faster, which was an OK improvement. On the other hand we noticed a weird phenomenom in a function that traverses through a tree structure using recursive calls. The program actually slowed down here with OpenMP on and the execution time of this function over doubled. We thought that maybe the tree-structure was not balanced enough for parallelization and commented out the OpenMP pragmas in this function. This appeared to have no effect on the execution time though. We are currently using GCC-compiler 4.4.6 with the -fopenmp flag on for OpenMP support. And here is the current problem:
If we don't use any omp pragmas in the code, all runs fine. But if we add just the following to the beginning of the program's main function, the execution time of the tree travelsal function over doubles from 35 seconds to 75 seconds:
//beginning of main function
...
#pragma omp parallel
{
#pragma omp single
{}
}
//main function continues
...
Does anyone have any clues about why this happens? I don't understand why the program slows down so greatly just from using the OpenMP pragmas. If we take off all the omp pragmas, the execution time of the tree traversal function drops back to 35 seconds again. I would guess that this is some sort of compiler bug as I have no other explanation on my mind right now.
Not everything that can be parallelized, should be parallelized. If you are using a single, then only one thread executes it and the rest have to wait until the region is done. They can either spin-wait or sleep. Most implementations start out with a spin-wait, hoping that the single region will not take too long and the waiting threads can see the completion faster than if sleeping. Spin-waits eat up a lot of processor cycles. You can try specifying that the wait should be passive - but this is only in OpenMP V3.0 and is only a hint to the implementation (so it might not have any effect). Basically, unless you have a lot of work in the parallel region that can compensate for the single, the single is going to increase the parallel overhead substantially and may well make it too expensive to parallelize.
First, OpenMP often reduces performance on first try. It can be tricky to to use omp parallel if you don't understand it inside-out. I may be able to help if you can you tell me a little more about the program structure, specifically the following questions annotated by ????.
//beginning of main function
...
#pragma omp parallel
{
???? What goes here, is this a loop? if so, for loop, while loop?
#pragma omp single
{
???? What goes here, how long does it run?
}
}
//main function continues
....
???? Does performance of this code reduce or somewhere else?
Thanks.
Thank you everyone. We were able to fix the issue today by linking with TCMalloc, one of the solutions ejd offered. The execution time dropped immediately and we were able to get around 40% improvement in execution times over a non-threaded version. We used 2 cores. It seems that when using OpenMP on Unix with GCC, you should also pick a replacement for the standard memory management solution. Otherwise the program may just slow down.
I did some more testing and made a small test program to test whether the issue could be memory operation related. I was unable to replicate the issue of an empty parallel-single region causing program to slow down in my small test program, but I was able to replicate the slow down by parallelizing some malloc calls.
When running the test program on Windows 7 64-bit with 2 CPU-cores, no noticeable slow down was caused by using -fopenmp flag with the gcc (g++) compiler and running the compiled program compared to running the program without OpenMP support.
Doing the same on Kubuntu 11.04 64-bit on the same computer, however, raised the execution to over 4 times of the non-OpenMP version. This issue seems to only appear on Unix-systems and not on Windows.
The source of my test program is below. I have also uploaded zipped-source for win and unix version as well as assembly source for win and unix version for both with and without OpenMP-support. This zip can be downloaded here http://www.2shared.com/file/0thqReHk/omp_speed_test_2011_05_11.html
#include <stdio.h>
#include <windows.h>
#include <list>
#include <sys/time.h>
//#include <cstdlib>
using namespace std;
int main(int argc, char* argv[])
{
// #pragma omp parallel
// #pragma omp single
// {}
int start = GetTickCount();
/*
struct timeval begin, end;
int usecs;
gettimeofday(&begin, NULL);
*/
list<void *> pointers;
#pragma omp parallel for default(shared)
for(int i=0; i< 10000; i++)
//pointers.push_back(calloc(20000, sizeof(void *)));
pointers.push_back(malloc(20000));
for(list<void *>::iterator i = pointers.begin(); i!= pointers.end(); i++)
free(*i);
/*
gettimeofday(&end, NULL);
if (end.tv_usec < begin.tv_usec) {
end.tv_usec += 1000000;
begin.tv_sec += 1;
}
usecs = (end.tv_sec - begin.tv_sec) * 1000000;
usecs += (end.tv_usec - begin.tv_usec);
*/
printf("It took %d milliseconds to finish the memory operations", GetTickCount() - start);
//printf("It took %d milliseconds to finish the memory operations", usecs/1000);
return 0;
}
What remains unanswered now is, what can I do to avoid issues such as these on the Unix-platform..
I want to use MPI to parallelize a function that is being called multiple times in my code. What i wanted to know was that if I use MPI_Init inside the function, will it be spawn the processes every time the function is called or will the spawning take place only once? Is there some known design pattern to do this in a systematic way?
The MPI_Init() call just initialises the MPI enviroment, it doesn't do any parallelisation itself. The parallelism comes from how you write the program.
A parallel "Hello, World", the printf() does different things depending on what rank (processor) it's running on. The number of processes is determined by how you execute the program (e.g. the number of processes is set via the -n parameter to mpiexec or mpirun)
int main(int argc, char *argv[]) {
char name[BUFSIZ];
int length=BUFSIZ;
int rank;
int numprocesses;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocesses);
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Get_processor_name(name, &length);
printf("%s, Rank %d of %d: hello world\n", name, rank, numprocesses);
MPI_Finalize();
return 0;
}
That's not really the way MPI (or distributed memory programming) works; you can't really just parallelize a function the way you can with something like OpenMP. At MPI, processes aren't spawned at the time of MPI_Init(), but at the time of running the executable (eg, with mpiexec; this is true even with MPI_Comm_spawn(). Part of the reason for that is that in distributed memory computing, launching processes on potentially a large number of shared-nothing nodes is a very expensive task.
You could cobble something together by having the function you're calling be in a separate executable, but I'm not sure that's what you want.