why is mpi slower on my laptop - performance

I am running MPI on my laptop (intel i7 quad core 4700m 12Gb RAM) and the efficiency drops even for codes that involve no inter-process communication. Obviously I cannot just throw 100 processes at it since my machine is only quad-core, but I thought that it should scale well up to 8 process (intel quad core simulates as 8???). For example consider the simple toy Fortran code:
program test
implicit none
integer, parameter :: root=0
integer :: ierr,rank,nproc,tt,i
integer :: n=100000
real :: s=0.0,tstart,tend
complex, dimension(100000/nproc) :: u=2.0,v=0.0
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nproc,ierr)
call cpu_time(tstart)
do tt=1,200000
v=0.0
do i=1,100000/nproc
v(i) = v(i) + 0.1*u(i)
enddo
enddo
call cpu_time(tend)
if (rank==root) then
print *, 'total time was: ',tend-tstart
endif
call MPI_FINALIZE(ierr)
end subroutine test2
For 2 processes it takes half the time, but even trying 4 processes (should be quarter of the time?) the result begins to become less efficient and for 8 processes there is no improvement whatsoever. Basically I am wondering if this is just because I am running on a laptop and has something to do with shared memory, or if I am making some fundamental mistake in my code. Thanks
Note: In the above example I manually change the nproc in the array declaration and the inner loop to be equal to the number of processors I am using.

A quad core processor, thanks to hyperthreading shows itself as having 8 threads, but physically they are just 4 cores. The other 4 are scheduled by the hardware itself using the free slots in the execution pipelines.
It happens that especially with compute intensive loads this approach does not pay at all, being often counter-productive too on extreme loads because of overheads and not always optimized cache usage.
You can try to disable hyperthreading in the BIOS and compare it: you will have just 4 threads, 4 cores.
Even going from 1 to 4 there are resources that are being in competition. In particular each core has its own L1 cache, but each pair of cores shares the L2 cache (2x256KB) and the 4 cores share the L3 cache.
And all the cores obviously share the memory channels.
So you cannot expect to have linear scaling occupying more and more cores, since they will have to balance the usage of the resources, that are dedicated to one core/one thread in the sequential case.
All of this without involving communications at all.
The same behavior happens on desktops/servers, in particular for memory-intensive loads, as the one in your test case.
For example it's less evident with matrix-matrix multiplies, that is compute-intensive: for a NxN matrix, you have O(N^2) memory accesses but O(N^3) floating point operations.

Related

Does the using of SIMD load main CPU registers?

Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to sum\multiply\do other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?
I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.
Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.
So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).
Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)
Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)
#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps
void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}
Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).
The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?
It just so happens that we are also making use of the SIMD units to do a couple of multiplications...
Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.
i.e. https://godbolt.org/z/xTcLrf

How can CPU's have FLOPS much higher than their clock speeds?

For example, a modern i7-8700k can supposedly do ~60 GFLOPS (single-precision, source) while its maximum frequency is 4.7GHz. As far as I am aware, an instruction has to take at least one cycle to complete, so how is this possible?
There are multiple factors that are all multiplied together for this large effect:
SIMD, Intel 8700k and similar processors support AVX and AVX2, which includes many instructions that operate on registers that can hold 8 floats at the same time.
multiple cores, 8700k has 6 cores.
fused multiply-add, part of AVX2, has both a multiplication and addition in the same instruction.
high throughput execution. The latency (time an individual instruction takes) is not directly important to how much computation a processor can do in a unit of time. A modern CPU such as 8700k can start executing two (independent) FMAs in the same cycle (and keep in mind these are still SIMD instructions so that represents a lot of floating point operations) even through the latency of the operation is actually 4 cycles.
Multiplying all those factors together we get: 8 * 6 * 2 * 2 * 4.3 = 825 GFLOPS (matching the stats reported here). This calculation certainly does not mean that it can actually be attained. For example the processor may downclock significantly under such a workload in order to stay within its power budget, which is what Intel has been doing at least since Haswell (though the specifics have changed and it applied to server parts). Also, most real code has significant trouble feeding that many FMAs with data. Large matrix multiplications can get close though, and for example according to these stats the 8700k reached 496.7 Gflops in their SGEMM benchmark. Possibly the 8700k's max AVX2 turbo speed on 6 cores is 2.6GHz but as far as I can find it does not have an AVX offset by default (only needed when overclocked), or that GEMM is just not that close to hitting peak FLOPS.

How does OpenCL distribute work items?

I'm testing and comparing GPU speed up with different numbers of work-items (no work-groups). The kernel I'm using is a very simple but long operation. When I test with multiple work-items, I use a barrier function and split the work in smaller chunks to get the same result as with just one work-item. I measure the kernel execution time using cl_event and the results are the following:
1 work-item: 35735 ms
2 work-items: 11822 ms (3 times faster than with 1 work-item)
10 work-items: 2380 ms (5 times faster than with 2 work-items)
100 work-items: 239 ms (10 times faster than with 10 work-items)
200 work-items: 122 ms (2 times faster than with 100 work-items)
CPU takes about 580 ms on average to do the same operation.
The only result I don't understand and can't explain is the one with 2 work items. I would expect the speed up to be about 2 times faster compared to the result with just one work item, so why is it 3?
I'm trying to make sense of these numbers by looking at how these work-items were distributed on processing elements. I'm assuming if I have just one kernel, only one compute unit (or multiprocessor) will be activated and the work items distributed on all processing elements (or CUDA cores) of that compute unit. What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
CL_DEVICE_MAX_WORK_ITEM_SIZES are 1024 / 1024 / 64 and CL_DEVICE_MAX_WORK_GROUP_SIZE 1024. Since I'm using just one dimension, does that mean I can have 1024 work-items running at the same time per processing element or per compute unit? When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
My GPU info: Nvidia GeForce GT 525M, 96 CUDA cores (2 compute units, 48 CUDA cores per unit)
The only result I don't understand and can't explain is the one with 2
work items. I would expect the speed up to be about 2 times faster
compared to the result with just one work item, so why is it 3?
The exact reasons will probably be hard to pin down, but here are a few suggestions:
GPUs aren't optimised at all for small numbers of work items. Benchmarking that end of the scale isn't especially useful.
35 seconds is a very long time for a GPU. Your GPU probably has other things to do, so your work-item is probably being interrupted many times, with its context saved and resumed every time.
It will depend very much on your algorithm. For example, if your kernel uses local memory, or a work-size dependent amount of private memory, it might "spill" to global memory, which will slow things down.
Depending on your kernel's memory access patterns, you might be running into the effects of read/write coalescing. More work items means fewer memory accesses.
What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
Most GPU hardware supports a form of SMT to hide memory access latency. So a compute core will have up to some fixed number of work items in-flight at a time, and if one of them is blocked waiting for a memory access or barrier, the core will continue executing commands on another work item. Note that the maximum number of simultaneous threads can be further limited if your kernel uses a lot of local memory or private registers, because those are a finite resource shared by all cores in a compute unit.
Work-groups will normally run on only one compute unit at a time, because local memory and barriers don't work across units. So you don't want to make your groups too large.
One final note: compute hardware tends to be grouped in powers of 2, so it's usually a good idea to make your work group sizes a multiple of e.g. 16 or 64. 1000 is neither, which usually means some cores will be doing nothing.
When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
Please be more precise in this question, it's not clear what you're asking.

Calculate CPU cycles and its handling bits per second

I've wondered about the amount of bits that a given CPU can handles and how should I manage to calculate this specific value.
I wanted to make sure that my thought and also my calculation are right.
Given that I have 64 bit CPU which feature 2.3 Ghz. The amount of bits that being handled per second should be the following :
(2.3 * 10^9) * 64
Is it that simple or I need consider any other variables?
Calculating throughput of a CPU is not as simple. Modern CPUs are pipelined and can execute multiple instructions concurrently if there are no data dependencies, yielding instructions per cycle (IPC) > 1. Some instructions may also have a higher latency than one cycle, meaning IPC can be < 1.
For some operations they might be constrained by memory bandwidth.
And then there are SIMD instructions which can process more data per instruction than the width of a single general purpose register.
http://www.agner.org/optimize/instruction_tables.pdf
https://en.wikipedia.org/wiki/Instruction_pipelining

Are not all processors created equal?

My laptop has 4 logical processors (two physical); logical CPUs 1 and 2 map to core 1, and logical CPUs 3 and 4 map to core 2 (verified with GetLogicalProcessorInformation()).
I ran a multithreaded matrix multiplication program on my computer with two threads. The first time, I used SetProcessAffinityMask(hProcess, 0x5) (which means logical processors 1 and 3) while the second time I used SetProcessAffinityMask(hProcess, 0xA) (logical processors 2 and 4).
It turned out that the first version was about twice as fast as the second version, as though I'd never multithreaded the second version anyway.
Does anyone have any guesses as to why this might be happening?
Measurements:
Plugged in (full CPU):
Affinity mask: 0x3 (0011b), 9 gflop/s
Affinity mask: 0x5 (0101b), 17 gflop/s
Affinity mask: 0x6 (0110b), 17 gflop/s
Affinity mask: 0x9 (1001b), 9 gflop/s
Affinity mask: 0xA (1010b), 9 gflop/s
Affinity mask: 0xC (1100b), 9 gflop/s
On battery (clocked down):
Affinity mask: 0x3 (0011b), 5 gflop/s
Affinity mask: 0x5 (0101b), 10 gflop/s
Affinity mask: 0x6 (0110b), 10 gflop/s
Affinity mask: 0x9 (1001b), 5 gflop/s
Affinity mask: 0xA (1010b), 2 gflop/s
(--> Very interesting, why half speed when on battery but normal speed on AC?! this one varies a lot between 1.5-2.5 gflop/s, unlike the others.)
Affinity mask: 0xC (1100b), 5 gflop/s
Does this imply that the fourth logical CPU is not doing anything (!)? (Everything with the mask for the fourth CPU set is slow.)
Update:
I just ran the same thing on the High Performance profile on batteries. The results are inconsistent: This time, I got 2x speedup for the masks 5, 6, and 10, but there was no speedup for mask 12. I'll try to run the tests again on AC power, but ultimately it seems like this result is a combination of power management, Turbo Boost, scheduling inconsistencies, etc., and it's more difficult to measure than I previously thought. :(
SetProcessAffinityMask() does not guarantee you will have one thread per core; only that the threads you have will run on the cores you have allowed.
Perhaps the OS is scheduling differently.
Also, I'm surprised 1 and 2 are on core 1. Usually, logical processor numbers interleave over physical cores, to provide an inherent load balancing. I would expect 1 and 3 to be on core 1, 2 and 4 to be on core 2.
No, not all cores are equal. Only one is the boot core. Furthermore, in many cases all IRQs (or at least IRQs from a majority of the devices) are directed to a single core.
More important to your observed behavior, not all sets of cores are equal. In a NUMA memory architecture (which have been relatively mainstream in x86 since Intel Hyperthreading and AMD Opteron), there's an ideal group of processors which can efficiently access a particular region of memory, and all other processors will pay a significant penalty to access that range.
With Hyperthreading, it's not main system memory that's connected non-uniformly, but L1 and L2 cache. If your process migrates between the two virtual processors associated to the same physical core, the cache remains valid. But if it migrates to the other physical core, cached data has to be copied and ownership transferred to the other cache. For some workloads, this could make a big difference.
It would be good to know what physical CPU this is, but I'm assuming from your phrasing about logical processors that there is 1 physical socket, 2 CPU cores, and hyperthreading is enabled giving you 4 logical processors.
The short answer is, for this complicated definition of "processor", no, not all processors are created equal. Hyperthreaded logical cores share execution resources, and if there's contention for those resources they won't be fast as separate physical cores. This sharing can take place at different levels for both hyperthreading and multicore processors (ALU, execution resources, cache at different levels, etc) but in broad terms, physical cores in the same socket won't be affected much by what the other core(s) is/are doing, and logical cores implemented by hyperthreading will be hugely affected by what their hypertwin is doing.
Another difference between different CPUs: As Ben said, your OS may process most hardware interrupts on a single CPU, which means that CPU will seem slower for other purposes, but I'd be surprised if the interrupt load is enough to impact performance anywhere near this much.
The results you got -- on processors A and B (being intentionally ambiguous about which 2 processors those are) you get double the performance of A alone, but on processors A and C you get approximately the same performance as A alone -- sure sound like hyperthreading is the difference, where A and C are hypertwins in the same physical core, and B is in the other physical core. You said that GetLogicalProcessorInformation() claims otherwise, but it's not unheard of for the BIOS tables on which that depends to have errors.
I would run Task Manager, keep an eye on loads on each CPU before you run your test to get an idea of how much else is going on and where Windows schedules it, then run your test again a few times, for different combinations of CPU affinity, and see if you can confirm or deny this theory.
Have you checked the return code from SetProcessAffinityMask to see if there was an error? If the call fails, you might get stuck on one logical processor. According to the documentation, you can only use the bits that are set in the result of GetProcessAffinityMask.
You say you've tried masks of 0x5, 0xA, and 0x9. I'd be curious to see the results with 0x3.

Resources