There is a law of parallel programming but I can't find a link to it. Shortly it tells the following: if 1/3 of time your program is executed sequentially (in one thread), you can increase your application performance only 3 times.
Amdahl's law refers to the old single-core CPUs. You may want to read carefully what happens on current multi-core CPUs. See
Amdahl's Law in the Multicore Era
and the related page where you will find additional resources.
That would be Amdahl's Law.
Related
My question primarily applies to firestorm/icestorm (because that's the hardware I have), but I am curious about what other representative arm cores do too. Arm has strange pre- and post-incremented addressing modes. If I have (for instance) two post-incremented loads from the same register, will the second depend on the first, or is the CPU smart enough to perform them in parallel?
AFAIK the exact behaviour of the M1 execution units is mainly undocumented. Still, there is certainly a dependency chain in this case. In fact, it would be very hard to break it and the design of modern processors make this even harder: the decoders, execution units, schedulers are distinct units and it would be insane to dynamically adapt the scheduling based on the instructions executed in parallel by execution units so to be able to break the chain in this particular case. Not to mention that instructions are pipelined and it generally takes few cycles for them to be committed. Furthermore, the time of the instructions is variable based on the fetched memory location. Finally, even this would be the case, the Firestorm documents does not mention such a feedback loop (see below for the links). Another possible solution for a processor to optimize such a pattern is to fuse the microinstructions so to combine the increment and add more parallelism but this is pretty complex to do for a relatively small improvement and there is no evidence showing Firestorm can do that so far (see here for more information about Firestorm instruction fusion/elimitation).
The M1 big cores (Apple's Firestorm) are designed to be massively parallel. They have 6 ALUs per core so they can execute a lot instructions in parallel on each core (possibly at the expense of a higher latency). However, this design tends to require a lot more transistors than current mainstream x86 Intel/AMD alternative (Alderlake/XX-Cove architecture put aside). Thus, the cores operate at a significantly lower frequency so to keep the energy consumption low. This means dependency chains are significantly more expensive on such an architecture compared to others unless there are enough independent instructions to be execute in parallel on the critical path. For more information about how CPUs works please thread Modern Microprocessors - A 90-Minute Guide!. For more information about the M1 processors and especially the Firestorm architecture, please read this deep analysis.
Note that Icestorm cores are designed to be energy efficient so they are far less parallel and thus having a dependency chain should be less critical on such a core. Still, having less dependency is often a good idea.
As for other ARM processors, recent core architecture are not as parallel as Firestorm. For example, the Cortex-A77 and Neoverse V1 have "only" 4 ALUs (which is already quite good). One need to also care about the latency of each instruction actually used in a given code. This information is available on the ARM website and AFAIK not yet published for Apple processors (one need to benchmark the instructions).
As for the pre VS post increment, I expect them to take the same time (same latency and throughput), especially on big cores like Firestorm (that try to reduce the latency of most frequent instruction at the expense of more transistors). However, the actual scheduling of the instruction for a given code can cause one to be slower than the other if the latency is not hidden by other instructions.
I received an answer to this on IRC: such usage will be fairly fast (makes sense when you consider it corresponds to typical looping patterns; good if the loop-carried dependency doesn't hurt too much), but it is still better to avoid it if possible, as it takes up rename bandwidth.
Gustafson's law is a counter to Amdahl's law claiming that most parallel processing applications actually increase the workload when having increased access to parallel processing. Thus, Amdahl's law which assumes constant workload to measure speedup is a poor method for determining benefits from parallel processing.
However, I'm confused as to what exactly Gustafson is trying to argue.
For example, say I take Gustafson's law and apply it to a sequential processor and two parallel processors:
I generate a workload to run on parallel processor #1. It takes 1 second to execute on the parallel processor and takes 10 seconds to execute on the sequential processor.
I generate a bigger workload and run it on parallel processor #2. It takes 1 second to execute on the parallel processor and takes 50 seconds to execute on the sequential processor.
For each workload, there is a speedup relative to the sequential processor. However, this doesn't seem to violate Amdahl's law since there is still some upper speedup limit for each workload. All this is doing is varying the workload such that the "serial only" portion of the code is likely reduced. As per Amdahl's law, decreasing the serial only portion will increase the speed-up limit.
So I'm confused, Gustafson's law which advocates for increasing the workload and maintain constant execution time doesn't seem to add any new information.
What exactly is Gustafson arguing? And specifically, what does "scaled-speedup" even mean?
Gustafson's law is a counter to Amdahl's law claiming that most parallel processing applications actually increase the workload when having increased access to parallel processing. Thus, Amdahl's law which assumes constant workload to measure speedup is a poor method for determining benefits from parallel processing.
No.
First of all, the two laws are not claiming anything. They show the theoretical (maximum) speed-up an application can get based on a specific configuration. The two are basic mathematical formula modelling the behaviour of a parallel algorithm. They goals is to understand some limitations of parallel computing (Amdahl's law) and what developers can do to overcomes them (Gustafson's law).
The Amdahl's law is a mathematical formula describing the theoretical speed-up of an application with a variable-time given workload (but the amount of computation is fixed) computed by several processing units. The workload contains a serial execution part and a parallel one. It shows that the speed-up is bounded by the portion of serial portion of the program. This is not great for developers of parallel applications since this means that the scalability of their application will be quite bad for some rather-sequential workload assuming the workload is fixed.
The Gustafson's law break this assumption and add a new one to look at the problem differently. Indeed, the mathematical formula describes the theoretical speed-up of an application with a fixed-time workload (but the amount of computation is variable) computed by several processing units. It shows that the speed-up can be good as long as application developers can add more parallel work to a given workload.
I generate a workload to run on parallel processor #1. It takes 1 second to execute on the parallel processor and takes 10 seconds to execute on the sequential processor.
I generate a bigger workload and run it on parallel processor #2. It takes 1 second to execute on the parallel processor and takes 50 seconds to execute on the sequential processor.
A parallel program cannot be more than 2 time faster with 2 processing unit compared to one processing unit with these two models. This is possible in practice due to some confounding factors (typically due to cache-effects) but the effect do not uniquely come from the processing units. What do you means by "two parallel processors"? The number of processing units is missing in your example (unless you want to find it from the provided informations).
So I'm confused, Gustafson's law which advocates for increasing the workload and maintain constant execution time doesn't seem to add any new information.
The two laws are like two point-of-views of the same scalability problem. However, they make different assumptions: fixed amount of work with a variable work time VS variable amount of work with a fixed work time.
If you are familiar with the concepts of strong/weak scaling, please note that the Amdahl's law is to the strong scaling what the Gustafson's law is to the weak scaling.
I know about Amdahl's law and maximum speedup of a parallel program. But I couldn't research Gustafson's law properly. What is Gustafson's law and what is the difference between Amdahl's and Gustafson's laws?
Amdahl's law
Suppose you have a sequential code and that a fraction f of its computation is parallelized and run on N processing units working in parallel, while the remaining fraction 1-f cannot be improved, i.e., it cannot be parallelized. Amdahl’s law states that the speedup achieved by parallelization is
Gustafson's law
Amdahl’s point of view is focused on a fixed computation problem size as it deals with a code taking a fixed amount of sequential calculation time. Gustafson's objection is that massively parallel machines allow computations previously unfeasible since they enable computations on very large data sets in fixed amount of time. In other words, a parallel platform does more than speeding up the execution of a code: it enables dealing with larger problems.
Suppose you have an application taking a time ts to be executed on N processing units. Of that computing time, a fraction (1-f) must be run sequentially. Accordingly, this application would run on a fully sequential machine in a time t equal to
If we increase the problem size, we can increase the number of processing units to keep the fraction of time the code is executed in parallel equal to f·ts. In this case, the sequential execution time increases with N which now becomes a measure of the problem size. The speedup then becomes
The efficiency would then be
so that the efficiency tends to f for increasing N.
The pitfall of these rather optimistic speedup and efficiency evaluations is related to the fact that, as the problem size increases, communication costs will increase, but increases in communication costs are not accounted for by Gustafson’s law.
References
G. Barlas, Multicore and GPU Programming: An Integrated Approach, Morgan Kaufmann
M.D. Hill, M.R. Marty, Amdahl’s law in the multicore era, Computer, vol. 41, n. 7, pp. 33-38, Jul. 2008.
GPGPU
There are interesting discussions on Amdahl's law as applied to General Purpose Graphics Processing Units, see
Amdahl's law and GPU
Amdahl's Law for GPU Is Amdahl's law accepted for GPUs too?
We are looking at the same problem from different perspectives. Amdahl's law says that if you have, say, 100 more CPUs, how much faster can you solve the same problem?
Gustafson's law is saying, if a parallel computer with 100 CPUs can solve this problem in 30 minutes, how long would it take for a computer with just ONE such CPU to solve the same problem?
Gustafson's law reflects the situations better. For example, we cannot use a 20 year old PC to play most of today's video games because they are just too slow.
Amdahl's law
Generally, if a fraction f of a job is impossible to divide into parallel parts, the whole thing can only get 1/f times faster in parallel. That is Amdahl's law.
Gustafson's law
Generally, for p participants, and an un-parallelizable fraction f the whole thing can get p+(1−p)f times faster . That is Gustafson's law
I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.
We are checking how fast is an algorithm running at the FPGA vs Normal Quad x86 computer.
Now at the x86 we run the algorithm lots of times, and take a median in order to eliminate OS overhead, also this "cleans" the curve from errors.
Thats not the problem.
The measure in the FPGA algorithm is in cycles and then take the cycles to time, with the FSMD is trivial to count cycles anyway...
We think that count cycles is too "pure" measure, and this could be done theoretically and dont need to make a real measure or running the algorithm in the real FPGA.
I want to know is there exist a paper or some idea, to do a real time measure.
If you are trying to establish that your FPGA implementation is competitive or superior, and therefore might be useful in the real world, then I encourage you to compare ** wall clock times ** on the multiprocessor vs. the FPGA implementation. That will also help ensure you do not overlook performance effects beyond the FSM + datapath (such as I/O delays).
I agree that reporting cycle counts only is not representative because the FPGA cycle time can be 10X that of off the shelf commodity microprocessors.
Now for some additional unsolicited advice. I have been to numerous FCCM conferences, and similar, and I have heard many dozens of FPGA implementation vs. CPU implementation performance comparison papers. All too often, a paper compares a custom FPGA implementation that took months, vs. a CPU+software implementation wherein the engineer just took the benchmark source code off the shelf, compiled it, and ran it in one afternoon. I do not find such presentations particularly compelling.
A fair comparison would evaluate a software implementation that uses best practices, best available libraries (e.g. Intel MKL or IPP), that used multithreading across multiple cores, that used vector SIMD (e.g. SSE, AVX, ...) instead of scalar computation, that used tools like profilers to eliminate easily fixed waste and like Vtune to understand and tune the cache+memory hierarchy. Also please be sure to report the actual amount of engineering time spent on the FPGA vs. the software implementations.
More free advice: In these energy focused times where results/joule may trump results/second, consider also reporting the energy efficiency of your implementations.
More free advice: to get most repeatable times on the "quad x86" be sure to quiesce the machine, shut down background processors, daemons, services, etc., disconnect the network.
Happy hacking!