Consider a case of p processors hanging off a shared bus. The peak bus
bandwidth is 40GB/s. Each of the processors performs a computation that requires
access to 4 bytes for each floating point operation (FLOP). What is the
peak FLOPs rating of such a system? What is the number of processors, p, for
which this peak rate is achieved?
Found this problem on a Homework Set online.. I believe we need to assume a value for the Processor clock. Am I right? Or is the given information enough?
Related
This question is related Computer Organization and Architecture. i would truly appreciate some assistance
Processor A has a clock speed of 1 GHz, and takes 1 cycle for integer operations, 2 cycles for memory operations, and 4 cycles for floating point operations. Empirical data shows that programs run on Processor A typically are composed of 35% floating point operations, 30% memory operations; and 35% integer operations.
However you wish to design a new processor, Processor B, which is an improvement on Processor A. Processor B will run the same programs as Processor A. To complete your design, you are faced with two options for improving performance:
Increase the clock speed to 1.2 GHz, but memory operations take 3 cycles
Decrease the clock speed to 900 MHz, but floating point operations only take 3 cycles
Compute the speedup for both options and decide the option Processor B should take.
i have tried finding the Execution time for Processors A and B by using ICxCPIxCT. I used MIPS (Clock rate/CPIx10^6) as the IC in all cases. When i was finished with everything, the Processor B(1) was deemed the most efficient since it had the lowest CPU time(940.3x10^-9) but i'm not sure if my method was correct.
I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.
In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs).
However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?
EDIT:
What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.
Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.
Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.
FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.
So the FLOP/s of a Haswell core is
its SIMD vector width (8 float elements per vector)
times SIMD FMA per clock (2)
times FLOPs per FMA (2)
times clock speed (max single core turbo it can sustain while maxing out both FMA units; long-term depends on cooling, short term just depends on power limits).
For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)
Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.
See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)
Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.
Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.
The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.
(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)
When it comes to optimisation, it is common practise to only measure FLOPs on the hotspots of your code, for example, the number of Floating Point Multiply & Accumulate operations in Convolution. This is mainly because other operations might be insignificant or irreplaceable and therefore can't be exploited for any kind of optimization.
For example, all instructions under Vector Floating Point Instructions in A4.13 in ARMv7 Reference Manual fall under a Floating Point Operation as a FLOPs/Cycle for an FPU instruction is typically constant in a processor.
Not just ARM, but many micro-processors have a dedicated Floating Point Unit, so when you are measuring FLOPs, you're measuring the speed of this unit. With this and FLOPs/cycle you can more or less calculate the theoretical peak performance.
But, FLOPs are to be taken with a grain of salt, as they can only be used to approximately estimate the speed of your code because they fail to take into account other conditions your processor operates under. This is why counting FLOPs only for your hotspots (usually arithmetic ops) is more or less enough in most cases.
Having said that, FLOPs can act as a comparative metric for two strenuous piece of code but doesn't say much about your code per se.
I've wondered about the amount of bits that a given CPU can handles and how should I manage to calculate this specific value.
I wanted to make sure that my thought and also my calculation are right.
Given that I have 64 bit CPU which feature 2.3 Ghz. The amount of bits that being handled per second should be the following :
(2.3 * 10^9) * 64
Is it that simple or I need consider any other variables?
Calculating throughput of a CPU is not as simple. Modern CPUs are pipelined and can execute multiple instructions concurrently if there are no data dependencies, yielding instructions per cycle (IPC) > 1. Some instructions may also have a higher latency than one cycle, meaning IPC can be < 1.
For some operations they might be constrained by memory bandwidth.
And then there are SIMD instructions which can process more data per instruction than the width of a single general purpose register.
http://www.agner.org/optimize/instruction_tables.pdf
https://en.wikipedia.org/wiki/Instruction_pipelining
I can calculate penalty when I have a single cache. But I'm unsure what to do when I am presented with two L1 caches (one for data and one for instruction) that are accessed in parallel. I'm also unsure what to do when I'm presented with clock cycles instead of actual time such as ns.
How do I calculate the average miss penalty using these new parameters?
Do I just use the formula two times and then average the miss penalty or is there more to this?
AMAT = hit time + miss rate * miss penalty
For example I have the following values:
AMAT = 4 clock cycles
L1 data access = 2 clock cycle (also hit time)
L1 instruction access = 2 clock cycle (also hit time)
60% of instructions are loads and stores
L1 instruction miss rate = 1%
L1 data miss rate = 3%
How would these values fit into AMAT?
Short answer
The average memory access time (AMAT) is typically calculated by taking the total number of instructions and dividing it by the total number of cycles spent servicing the memory request.
Details
On page B-17 of Computer Architecture a Quantiative Approach, 5th edition AMAT is defined as:
Average memory access time = % instructions x (Hit time + instruction miss rate x miss penalty) + % data x (Hit time + Data miss rate x miss penalty)`.
As you can see in this formula each instruction counts for a single memory access and the instructions that operate on data (load/store) constitute an additional memory access.
Note that there are many simplifying instructions that are made when using AMAT, and depending on the performance analysis that you want to perform. The same textbook I quotes earlier notes that:
In summary, although the state of the art in defining and measuring
memory stalls for out-of-order processors is complex, be aware of the
issues because they significantly affect performance. The complexity
arises because out-of-order processors tolerate some latency due to
cache misses without hurting performance. Consequently, designers
normally use simulators of the out-of-order processor and memory when
evaluating trade-offs in the memory hierarchy to be sure that an
improvement that helps the average memory latency actually helps
program performance.
My point of including this quote is that in practice AMAT is used for getting an approximate comparison between various different option. And as a result there are always simplifying assumptions used. But generally the memory accesses for instructions and data are added together to get a total number of accesses when calculating AMAT, rather than being calculated separately.
The way I see it, since the L1 Instruction Cache and the L1 Data Cache are accessed in parallel, you should compute AMAT for Instructions and AMAT for data, and then take the largest value as the final AMAT.
In your example since the Data Miss Rate is higher than Instruction Miss Rate you can consider that during the time the CPU waits for data, it solves all the misses on the instruction cache.
If the measure unit is cycles you do the same as if it were nanoseconds. If you know the frequency of your processor, you can convert back the AMAT in nanoseconds.
I have a device providing the peak GFLOPS specs and I want to measure how far my program is away from it. Since all the data I used was double precision, should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?
No. 1 double-precision floating-point operation is still one floating-point operation.
Most GPUs process double-precision data slower than single-precision, so there should be two specifications of peak GFLOPS. One peak single-precision GFLOPS spec, and one peak double-precision GFLOPS spec. Sometimes it is broken done further, so that (for example) peak division performance is listed separately from peak addition performance.
" ... , should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?"
No, not for any (but one) of these Cards: http://www.geeks3d.com/20140305/amd-radeon-and-nvidia-geforce-fp32-fp64-gflops-table-computing/ .
Note that the ratio varies from 1/24th to as good as 1/3 in most cases, also note that the 'Workstation Graphics Card' has a ratio 1/2 - it is specifically designed that way to improve DP performance.
You need to read the Specs for the Hardware in your Card and determine what performance hit you should expect from switching to DP from SP. There will be a small additional amount of overhead to load the additional precision into the Registers (Memory where the Hardware will perform the Operation on) and to retrieve the additional precision after each Operation.