I am developing a MPI based parallel numerical solver for a Laplace equation subjected to Dirichlet Boundary condition using Finite Difference method.I have taken a square computational domain with a grid size of 1024 x 1024. My objective is to evaluate the Scalability of the developed solver.
I have tested the code using 4,8,16,32,64,128 and 256 processors on a HPC facility having 8 nodes with each node consisting of 48 cores. Until 128 processors I could find a linear scalability in the computational time.(i.e between two consecutive cores say 4 and 8, 8 and 16, 16 and 32, 32 and 64 the solution converges and attains the set L2 and L-infinity true error norms values in the order of 10^-7 and 10^-8 without any issues and also the computational time reduces by half. whereas when i run the code using 128 and 256 cores the error norm values are not following a similar trend as that for 64 cores. The convergence rate drastically reduces say for 128 processors the error norm value is not reducing beyond 10^-6 and it takes more than half the computational time for 64 cores.Similarly with 256 cores the convergence rate is very poor and the computational time is very high.So, I request the experts in this field to help me to resolve this issue.
When I try to optimize my code, for a very long time I've just been using a rule of thumb that addition and subtraction are worth 1, multiplication and division are worth 3, squaring is worth 3 (I rarely use the more general pow function so I have no rule of thumb for it), and square roots are worth 10. (And I assume squaring a number is just a multiplication, so worth 3.)
Here's an example from a 2D orbital simulation. To calculate and apply acceleration from gravity, first I get distance from the ship to the center of earth, then calculate the acceleration.
D = sqrt( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 19
A = G*Earth.mass/sqr(D); // this is worth 9, total is 28
However, notice that in calculating D, you take a square root, but when using it in the next calculation, you square it. Therefore you can just do this:
A = G*Earth.mass/( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 15
So if my rule of thumb is true, I almost cut in half the cycle time.
However, I cannot even remember where I heard that rule before. I'd like to ask what is the actual cycle times for those basic arithmetic operations?
Assumptions:
everything is a 64-bit floating number in x64 architecture.
everything is already loaded into registers, so no worrying about hits and misses from caches or memory.
no interrupts to the CPU
no if/branching logic such as look ahead prediction
Edit: I suppose what I'm really trying to do is look inside the ALU and only count the cycle time of its logic for the 6 operations. If there is still variance within that, please explain what and why.
Note: I did not see any tags for machine code, so I chose the next closest thing, assembly. To be clear, I am talking about actual machine code operations in x64 architecture. Thus it doesn't matter whether those lines of code I wrote are in C#, C, Javascript, whatever. I'm sure each high-level language will have its own varying times so I don't wanna get into an argument over that. I think it's a shame that there's no machine code tag because when talking about performance and/or operation, you really need to get down into it.
At a minimum, one must understand that an operation has at least two interesting timings: the latency and the throughput.
Latency
The latency is how long any particular operation takes, from its inputs to its output. If you had a long series of operations where the output of one operation is fed into the input of the next, the latency would determine the total time. For example, an integer multiplication on most recent x86 hardware has a latency of 3 cycles: it takes 3 cycles to complete a single multiplication operation. Integer addition has a latency of 1 cycle: the result is available the cycle after the addition executes. Latencies are generally positive integers.
Throughput
The throughput is the number of independent operations that can be performed per unit time. Since CPUs are pipelined and superscalar, this is often more than the inverse of the latency. For example, on most recent x86 chips, 4 integer addition operations can execute per cycle, even though the latency is 1 cycle. Similarly, 1 integer multiplication can execute, on average per cycle, even though any particular multiplication takes 3 cycles to complete (meaning that you must have multiple independent multiplications in progress at once to achieve this).
Inverse Throughput
When discussing instruction performance, it is common to give throughput numbers as "inverse throughput", which is simply 1 / throughput. This makes it easy to directly compare with latency figures without doing a division in your head. For example, the inverse throughput of addition is 0.25 cycles, versus a latency of 1 cycle, so you can immediately see that you if you have sufficient independent additions, they use only something like 0.25 cycles each.
Below I'll use inverse throughput.
Variable Timings
Most simple instructions have fixed timings, at least in their reg-reg form. Some more complex mathematical operations, however, may have input-dependent timings. For example, addition, subtraction and multiplication usually have fixed timings in their integer and floating point forms, but on many platforms division has variable timings in integer, floating point or both. Agner's numbers often show a range to indicate this, but you shouldn't assume the operand space has been tested extensively, especially for floating point.
The Skylake numbers below, for example, show a small range, but it isn't clear if that's due to operand dependency (which would likely be larger) or something else.
Passing denormal inputs, or results that themselves are denormal may incur significant additional cost depending on the denormal mode. The numbers you'll see in the guides generally assume no denormals, but you might be able to find a discussion of denormal costs per operation elsewhere.
More Details
The above is necessary but often not sufficient information to fully qualify performance, since you have other factors to consider such as execution port contention, front-end bottlenecks, and so on. It's enough to start though and you are only asking for "rule of thumb" numbers if I understand it correctly.
Agner Fog
My recommended source for measured latency and inverse throughput numbers are Agner's Fogs guides. You want the files under 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, which lists fairly exhaustive timings on a huge variety of AMD and Intel CPUs. You can also get the numbers for some CPUs directly from Intel's guides, but I find them less complete and more difficult to use than Agner's.
Below I'll pull out the numbers for a couple of modern CPUs, for the basic operations you are interested in.
Intel Skylake
Lat Inv Tpt
add/sub (addsd, subsd) 4 0.5
multiply (mulsd) 4 0.5
divide (divsd) 13-14 4
sqrt (sqrtpd) 15-16 4-6
So a "rule of thumb" for latency would be add/sub/mul all cost 1, and division and sqrt are about 3 and 4, respectively. For throughput, the rule would be 1, 8, 8-12 respectively. Note also that the latency is much larger than the inverse throughput, especially for add, sub and mul: you'd need 8 parallel chains of operations if you wanted to hit the max throughput.
AMD Ryzen
Lat Inv Tpt
add/sub (addsd, subsd) 3 0.5
multiply (mulsd) 4 0.5
divide (divsd) 8-13 4-5
sqrt (sqrtpd) 14-15 4-8
The Ryzen numbers are broadly similar to recent Intel. Addition and subtraction are slightly lower latency, multiplication is the same. Latency-wise, the rule of thumb could still generally be summarized as 1/3/4 for add,sub,mul/div/sqrt, with some loss of precision.
Here, the latency range for divide is fairly large, so I expect it is data dependent.
a program runs 100s. multiyply instructions are 80% of the program.There is need to make the program 2 times faster. how much should the multiply instruction be speedup achieve to reach the overall speedup level ?
pls help me to solve this question by amdal's low....
Hint:
The running time is 80 s for multiplies and 20 s for the rest.
You need to bring this down to 30 / 20 s.
I have a device providing the peak GFLOPS specs and I want to measure how far my program is away from it. Since all the data I used was double precision, should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?
No. 1 double-precision floating-point operation is still one floating-point operation.
Most GPUs process double-precision data slower than single-precision, so there should be two specifications of peak GFLOPS. One peak single-precision GFLOPS spec, and one peak double-precision GFLOPS spec. Sometimes it is broken done further, so that (for example) peak division performance is listed separately from peak addition performance.
" ... , should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?"
No, not for any (but one) of these Cards: http://www.geeks3d.com/20140305/amd-radeon-and-nvidia-geforce-fp32-fp64-gflops-table-computing/ .
Note that the ratio varies from 1/24th to as good as 1/3 in most cases, also note that the 'Workstation Graphics Card' has a ratio 1/2 - it is specifically designed that way to improve DP performance.
You need to read the Specs for the Hardware in your Card and determine what performance hit you should expect from switching to DP from SP. There will be a small additional amount of overhead to load the additional precision into the Registers (Memory where the Hardware will perform the Operation on) and to retrieve the additional precision after each Operation.
From a 16MB DRAM, I have to calculate the maximum time it can take to read 8300 consecutive values. Here are the specifications that I have:
-the DRAM is structured as a table of 4096 x 4096 cell.
-it has a time cycle (Tc) of 65 ns.
-in page mode it has a time cycle (Tpm) of only 45 ns.
I thought it was simply done by calculating the number of cells in the DRAM and then calculating the percentage that 8300 represents from the total (4096 x 4096) and then taking that same percentage and multiplying it to the time access. Unfortunately it did not give me the right answer... Any help would be greatly appreciated! Thanks guys
There are many variables into account (e.g., Using open- or close-page mode, number of ranks), thus is memory dependent. In order for you to have a better understanding and re-state your question, please read this paper, which helped me to understand RAM better.
Power and Performance Trade-Offs in Contemporary DRAM System Designs for Multicore Processors
You can search for it in Google Scholar.
Thanks.