Can a Parallel Processing Efficiency become > 1? - parallel-processing

I read about efficiency in parallel computing, but never got an clear idea about it, also I read about achieving efficiency >1 and conclude that it's possible when it's a super linear.
Is that correct and possible?
If yes, then can anybody tell me how and provide an example for that?
Or, if it is not, then why?

Let's agree on a few terms first:
A set of processes may get scheduled for execution under several different strategies --
[SERIAL] - i.e. execute one after another has finished, till all are done, or
[PARALLEL] - i.e. all-start at once, all-execute at once, all-terminate at once
or
in a "just"-[CONCURRENT] fashion - i.e. some start at once, as resources permit, others are scheduled for [CONCURRENT] execution whenever free or new resources permit. The processing gets finished progressively, but without any coordination, just as resources-mapping and priorities permit.
Next,let's define a measure, how to compare a processing efficiency, right?
Given an efficiency may be related to power-consumption or to processing-time, let's focus on processing-time, ok?
Gene Amdahl has elaborated domain of generic processing speedups, from which we will borrow here. The common issue in HPC / computer science education is that lecturers do not emphasise the real-world costs of organising the parallel-processing. For this reason, the overhead-naive ( original ) formulation of the Amdahl's Law ought be always used in an overhead-strict re-formulation, because otherwise any naive-form figures are in parallel-computing
just comparing apples to oranges.
On the other hand, once both the process-add-on-setup-overhead costs and process-termination-add-on-overhead costs are recorded into the scheme, the overhead-strict speedup comparison starts to make sense to speak about processing-time efficiency.
Having said this, there are cases, when processing-time efficiency can become > 1, while it is fair to say, that a professional due care has to be taken and that not all processing-types permit to gain any remarkable speedup on whatever large-scale pool of code-execution resources, right due to the obligation to pay and cover the add-on costs of the NUMA / distributed-processing overheads.
May like to read further and experiment
with an overhead-strict Amdahl Law re-formulation [PARALLEL]-processing speedups GUI-interactive-tool cited here
Being in Sweden, your must-read is the Andreas Olofsson's personal story about his remarkable effort and experience with piloting parallel-hardware with many first-ever-s on the way from Kickstarter to a DARPA-acquired [PARALLEL]-hardware know-how.

Related

How do you reason about fluctuations in benchmarking data?

Suppose you're trying to optimize a function and using some benchmarking framework (like Google Benchmark) for measurement. You run the benchmarks on the original function 3 times and see average wall clock time/CPU times of 100 ms, 110 ms, 90 ms. Then you run the benchmarks on the "optimized" function 3 times and see 80 ms, 95 ms, 105 ms. (I made these numbers up). Do you conclude that your optimizations were successful?
Another problem I often run into is that I'll go do something else and run the benchmarks later in the day and get numbers that are further away than the delta between the original and optimized earlier in the day (say, 80 ms, 85 ms, 75 ms for the original function).
I know there are statistical methods to determine whether the improvement is "significant". Do software engineers actually use these formal calculations in practice?
I'm looking for some kind of process to follow when optimizing code.
Rule of Thumb
Minimum(!) of each series => 90ms vs 80ms
Estimate noise => ~ 10ms
Pessimism => It probably didn't get any slower.
Not happy yet?
Take more measurements. (~13 runs each)
Interleave the runs. (Don't measure 13x A followed by 13x B.)
Ideally you always randomize whether you run A or B next (scientific: randomized trial), but it's probably overkill. Any source of error should affect each variant with the same probability. (Like the CPU building up heat over time, or a background task starting after run 11.)
Go back to step 1.
Still not happy? Time to admit it that you've been nerd-sniped. The difference, if it exists, is so small that you can't even measure it. Pick the more readable variant and move on. (Or alternatively, lock your CPU frequency, isolate a core just for the test, quiet down your system...)
Explanation
Minimum: Many people (and tools, even) take the average, but the minimum is statistically more stable. There is a lower limit how fast your benchmark can run on a given hardware, but no upper limit much it can get slowed down by other programs. Also, taking the minimum will automatically drop the initial "warm-up" run.
Noise: Apply common sense, just glance over the numbers. If you look a the standard deviation, make that look very skeptical! A single outlier will influence it so much that it becomes nearly useless. (It's not a normal distribution, usually.)
Pessimism: You were really clever to find this optimization, you really want the optimized version to be faster! If it looks better just by chance, you will believe it. (You knew it!) So if you care about being correct, you must counter this tendency.
Disclaimer
Those are just basic guidelines. Worst-case latency is relevant in some applications (smooth animations or motor control), but it will be harder to measure. It's easy (and fun!) to optimize something that doesn't matter in practice. Instead of wondering if your 1% gain is statistically significant, try something else. Measure the full program including OS overhead. Comment out code, or run work twice, only to check if optimizing it might be worth it.
Do you conclude that your optimizations were successful?
No. 3 runs is not enough especially due to the huge variation and the fact that some timings of the two groups are mixed once merged and sorted.
For small timings like this, the first run should be removed and at least dozens of runs should be performed. I would personally use at least hundreds of runs.
Do software engineers actually use these formal calculations in practice?
Only very few developers does advanced statistical analysis. It is often not needed to do something very formal when the gab before/after the target optimization is huge and the variation within groups is small.
For example, if your program is twice faster than before with a min-max variation of <5%, then you can quite safely say that the optimization is successful. That being said, it is sometimes not the case due to unexpected external factors (though it is very rare when the gap is so big).
If the result is not obvious, then you need to do some statistic basics. You need to compute the standard deviation, the mean and median time, remove the first run, interleave runs and use many runs (at least dozens). The distribution of the timings almost always follow a normal distribution due to the central limit theorem. It is sometimes a mixture distribution due to the threshold effects (eg. caching). You can plot the value to see that easily if you see some outliers in timings.
If there are threshold effects, then you need to apply an advanced statistical analysis but this is complex to do and generally it is not an expected behaviour. I is generally a sign that the benchmark is biased, there is a bug or a complex effect you have to consider during the analysis of the result anyway. Thus, I strongly advise you to fix/mitigate the problem before analysing the results in that case.
Assuming the timings follow a normal distribution, you can just check if the median is close to the mean and if the standard deviation is small compare to the gap between the mean.
A more formal way to do that is to compute the Student t-test and its associated p-value and check the significance of the p-value (eg. <5%). If there are more groups, An Anova can be used. If you are unsure about the distribution, you can apply non-parametric statistical tests like the Wilcoxon and Kruskal-Wallis tests (note that the statistical power of these test is not the same). In practice, doing such a formal analysis is time-consuming and it is generally not so useful compare to a naive basic check (using the mean and standard deviation) unless your modification impacts a lot of users or you plan to write research papers.
Keep in mind that using a good statistical analysis does not prevent biased benchmarks. You need to minimize the external factors that can cause biased results. One frequent bias is frequency scaling: the first benchmark can be faster than the second because of turbo-boost or it can be slower because the processor can take some time to reach a high frequency. Caches also plays a huge role in benchmark biases. There are many other factors that can cause biases in practice like the compiler/runtime versions, environment variables, configuration files, OS/driver updates, memory alignment, OS paging (especially on NUMA systems), the hardware (eg. thermal throttling), software bugs (it is not rare to find bugs by analysing strange performance behaviours), etc.
As a result, it is critical to make benchmarks as reproducible as possible (by fixing versions and reporting the environment parameters (as well as possibly run the benchmarks in a sandbox if you are paranoid and if it does not affect too much the timings). Software like Nix/Spack help for packaging, and containers like LXD, Docker could help for a more reproducible environment.
Many big software team use automated benchmarking to check the presence of performance regression. Tools can do the run properly and statistical analysis for you regularly. A good example is the Numpy team which use a package called Airspeed Velocity (see the results). The PyPy team also designed their own benchmarking tool. The Linux kernel also have benchmarking suite to check for regression (eg. PTS) and many company focusing on performance have such automated benchmarking tools (often home-made). There are many existing tools for that.
For more information about this topic, please give a look to the great Performance Matters presentation by Emery Berger.

Single threaded workloads

There is some feeling about the CPUs manufactured today does not meet Moores Law anymore, at least for single threaded performance.
I wonder what kind of workloads we have to worry if single threaded performance does not scale.
Breaking a text to lines and pages is a quite serial work, but on the other hand any human readable books or pages are quite finite length and handled by current text processing algorithms.
"Bureaucracy code" (the code that makes new versions of Word feel sluggish over decades despite exponential CPU performance increase) do stagnate either in my opinion (the programmers can't handle abitrary complexity and the large software companies finally joined some battle for performance in mobile computing).
So what kind of algorithms at all would hurt us if single thread performance increase would come to an end?
What you call "bureaucracy code" is not slow because the developers cannot handle the complexity anymore (todays WinWord isn't that more complex than a decade past). The cause for Word (used as an example here) feeling as stagnant as always is that developers do not develop in a vacuum, but there is money and time involved. I.e., real-world developers always take shortcuts wherever they can. Word will always be just as sluggy as before because it will always just skim the line between usability and annoyance, when running on an average consumer device. Using more CPU makes it more "sexy" (more sales). Making it faster will cost more while not really increasing sales. The project management has the top-level goal of making as much money as possible.
A contrary example would be NASA using stone-age CPUs for their spaceships - they don't bother about money, they have other goals (like stability/fault tolerance), and those goals are not helped by a huge overhead of CPU power.
"Worry" is never useful, in no aspect of life, so I would not worry. Can you rephrase what you actually mean? If you mean to ask if there will be some kind of global problem if/when Moore's Law comes to an end, then no, unless you happen to work in a part of industry that turns obsolete, you have nothing to worry about.

does i/o operations within loop decrease performance?

I am confused how to take inputs from file. In many of hacking challenge sites input will be in the format
-no of test cases
case1
case2
..so on
My question is what is the best way to read the inputs. Whether i should read all the cases in one loop, store it in an array and then perform operations on the inputs in separate loop OR just use one loop for both- to read and perform operations on input. Is there any performance issue between these two methods?
I'd say not significantly. Either way, the same number of operations will be performed, the question is about clumping together "the same kind" of operations. Like AndreyT said, organizing it based in stages that act on the same general area of memory might increase performance. It really depends on the kind of input and output your doing, as well as some operating system and programming language specific variables. The question basically comes down to the question of if "input output input output" is slower than "input input output output", and I think that depends highly on the programming language and the data-structures you're using. Look up how to set a timer or stopwatch on some code, and you can test it out for yourself. My hunch is that it will make very little to no difference, but you could be using different data-structures than I have in mind, so you'll just have to test it out.
So it could help, but in my experience unless you're doing some serious number crunching or need a highly optimized code, there's a certain point where gaining a fraction of a second faster operation isn't necessary. Modern computers run so blindingly fast that you usually have a lot of computational power to spare. Of course, in a case where you're doing something really computationally intensive, every little bit helps. It's all about trade-off between programming time and running time.
There's no definite answer to your question. Basically, "it depends".
A general performance-affecting principle that should be observed in many cases on modern hardware platforms is as follows: when you attempt to do many unrealted (or loosely related) things at once, involving access to several unrelated regions of memory, the memory locality of your code worsens and the cache behavior of your program worsens as well. Bad memory locality and the consequent poor cache behavior can have notable negative impact on the performance of your code. For this reason, it is usually good idea to organize your code in consequent stages, each stage working within a more-or-less well defined localized memory region.
This is a very general principle that is not directly related to input/output. It just that input/output might prove to be one of those things that could make your code to do "too many things at once". Whether it will have an impact on the performance of your code really depends on the specifics of your code.
If your code is dominated by slow input/output operations, then its cache performance will not be a significant factor at all. If, on the other hand, your code spends most of its time doing memory-intensive computations, then it might be a good idea to experiment with such things as elimination of I/O operations from the main computation cycles.
Depends on some variants:
Type of operation ? Input or Output
Parallel ? Sync call or Async call
Parallelism Degree ? is there dependency between iterations

Optimization! - What is it? How is it done?

Its common to hear about "highly optimized code" or some developer needing to optimize theirs and whatnot. However, as a self-taught, new programmer I've never really understood what exactly do people mean when talking about such things.
Care to explain the general idea of it? Also, recommend some reading materials and really whatever you feel like saying on the matter. Feel free to rant and preach.
Optimize is a term we use lazily to mean "make something better in a certain way". We rarely "optimize" something - more, we just improve it until it meets our expectations.
Optimizations are changes we make in the hopes to optimize some part of the program. A fully optimized program usually means that the developer threw readability out the window and has recoded the algorithm in non-obvious ways to minimize "wall time". (It's not a requirement that "optimized code" be hard to read, it's just a trend.)
One can optimize for:
Memory consumption - Make a program or algorithm's runtime size smaller.
CPU consumption - Make the algorithm computationally less intensive.
Wall time - Do whatever it takes to make something faster
Readability - Instead of making your app better for the computer, you can make it easier for humans to read it.
Some common (and overly generalized) techniques to optimize code include:
Change the algorithm to improve performance characteristics. If you have an algorithm that takes O(n^2) time or space, try to replace that algorithm with one that takes O(n * log n).
To relieve memory consumption, go through the code and look for wasted memory. For example, if you have a string intensive app you can switch to using Substring References (where a reference contains a pointer to the string, plus indices to define its bounds) instead of allocating and copying memory from the original string.
To relieve CPU consumption, cache as many intermediate results if you can. For example, if you need to calculate the standard deviation of a set of data, save that single numerical result instead looping through the set each time you need to know the std dev.
I'll mostly rant with no practical advice.
Measure First. Optimization should be done to places where it matters. Highly optimized code is often difficult to maintain and a source of problems. In places where the code does not slow down execution anyway, I alwasy prefer maintainability to optimizations. Familiarize yourself with Profiling, both intrusive (instrumented) and non-intrusive (low overhead statistical). Learn to read a profiled stack, understand where the time inclusive/time exclusive is spent, why certain patterns show up and how to identify the trouble spots.
You can't fix what you cannot measure. Have your program report through some performance infrastructure the thing it does and the times it takes. I come from a Win32 background so I'm used to the Performance Counters and I'm extremely generous at sprinkling them all over my code. I even automatized the code to generate them.
And finally some words about optimizations. Most discussion about optimization I see focus on stuff any compiler will optimize for you for free. In my experience the greatest source of gains for 'highly optimized code' lies completely elsewhere: memory access. On modern architectures the CPU is idling most of the times, waiting for memory to be served into its pipelines. Between L1 and L2 cache misses, TLB misses, NUMA cross-node access and even GPF that must fetch the page from disk, the memory access pattern of a modern application is the single most important optimization one can make. I'm exaggerating slightly, of course there will be counter example work-loads that will not benefit memory access locality this techniques. But most application will. To be specific, what these techniques mean is simple: cluster your data in memory so that a single CPU can work an a tight memory range containing all it needs, no expensive referencing of memory outside your cache lines or your current page. In practice this can mean something as simple as accessing an array by rows rather than by columns.
I would recommend you read up the Alpha-Sort paper presented at the VLDB conference in 1995. This paper presented how cache sensitive algorithms designed specifically for modern CPU architectures can blow out of the water the old previous benchmarks:
We argue that modern architectures
require algorithm designers to
re-examine their use of the memory
hierarchy. AlphaSort uses clustered
data structures to get good cache
locality...
The general idea is that when you create your source tree in the compilation phase, before generating the code by parsing it, you do an additional step (optimization) where, based on certain heuristics, you collapse branches together, delete branches that aren't used or add extra nodes for temporary variables that are used multiple times.
Think of stuff like this piece of code:
a=(b+c)*3-(b+c)
which gets translated into
-
* +
+ 3 b c
b c
To a parser it would be obvious that the + node with its 2 descendants are identical, so they would be merged into a temp variable, t, and the tree would be rewritten:
-
* t
t 3
Now an even better parser would see that since t is an integer, the tree could be further simplified to:
*
t 2
and the intermediary code that you'd run your code generation step on would finally be
int t=b+c;
a=t*2;
with t marked as a register variable, which is exactly what would be written for assembly.
One final note: you can optimize for more than just run time speed. You can also optimize for memory consumption, which is the opposite. Where unrolling loops and creating temporary copies would help speed up your code, they would also use more memory, so it's a trade off on what your goal is.
Here is an example of some optimization (fixing a poorly made decision) that I did recently. Its very basic, but I hope it illustrates that good gains can be made even from simple changes, and that 'optimization' isn't magic, its just about making the best decisions to accomplish the task at hand.
In an application I was working on there were several LinkedList data structures that were being used to hold various instances of foo.
When the application was in use it was very frequently checking to see if the LinkedListed contained object X. As the ammount of X's started to grow, I noticed that the application was performing more slowly than it should have been.
I ran an profiler, and realized that each 'myList.Contains(x)' call had O(N) because the list has to iterate through each item it contains until it reaches the end or finds a match. This was definitely not efficent.
So what did I do to optimize this code? I switched most of the LinkedList datastructures to HashSets, which can do a '.Contains(X)' call in O(1)- much better.
This is a good question.
Usually the best practice is 1) just write the code to do what you need it to do, 2) then deal with performance, but only if it's an issue. If the program is "fast enough" it's not an issue.
If the program is not fast enough (like it makes you wait) then try some performance tuning. Performance tuning is not like programming. In programming, you think first and then do something. In performance tuning, thinking first is a mistake, because that is guessing.
Don't guess what to fix; diagnose what the program is doing.
Everybody knows that, but mostly they do it anyway.
It is natural to say "Could be the problem is X, Y, or Z" but only the novice acts on guesses. The pro says "but I'm probably wrong".
There are different ways to diagnose performance problems.
The simplest is just to single-step through the program at the assembly-language level, and don't take any shortcuts. That way, if the program is doing unnecessary things, then you are doing the same things, and it will become painfully obvious.
Another is to get a profiling tool, and as others say, measure, measure, measure.
Personally I don't care for measuring. I think it's a fuzzy microscope for the purpose of pinpointing performance problems. I prefer this method, and this is an example of its use.
Good luck.
ADDED: I think you will find, if you go through this exercise a few times, you will learn what coding practices tend to result in performance problems, and you will instinctively avoid them. (This is subtly different from "premature optimization", which is assuming at the beginning that you must be concerned about performance. In fact, you will probably learn, if you don't already know, that premature concern about performance can well cause the very problem it seeks to avoid.)
Optimizing a program means: make it run faster
The only way of making the program faster is making it do less:
find an algorithm that uses fewer operations (e.g. N log N instead of N^2)
avoid slow components of your machine (keep objects in cache instead of in main memory, or in main memory instead of on disk); reducing memory consumption nearly always helps!
Further rules:
In looking for optimization opportunities, adhere to the 80-20-rule: 20% of typical program code accounts for 80% of execution time.
Measure the time before and after every attempted optimization; often enough, optimizations don't.
Only optimize after the program runs correctly!
Also, there are ways to make a program appear to be faster:
separate GUI event processing from back-end tasks; priorize user-visible changes against back-end calculation to keep the front-end "snappy"
give the user something to read while performing long operations (every noticed the slideshows displayed by installers?)
However, as a self-taught, new programmer I've never really understood what exactly do people mean when talking about such things.
Let me share a secret with you: nobody does. There are certain areas where we know mathematically what is and isn't slow. But for the most part, performance is too complicated to be able to understand. If you speed up one part of your code, there's a good possibility you're slowing down another.
Therefore, anyone who tells you that one method is faster than another, there's a good possibility they're just guessing unless one of three things are true:
They have data
They're choosing an algorithm that they know is faster mathematically.
They're choosing a data structure that they know is faster mathematically.
Optimization means trying to improve computer programs for such things as speed. The question is very broad, because optimization can involve compilers improving programs for speed, or human beings doing the same.
I suggest you read a bit of theory first (from books, or Google for lecture slides):
Data structures and algorithms - what the O() notation is, how to calculate it,
what datastructures and algorithms can be used to lower the O-complexity
Book: Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest
Compilers and assembly - how code is translated to machine instructions
Computer architecture - how the CPU, RAM, Cache, Branch predictions, out of order execution ... work
Operating systems - kernel mode, user mode, scheduling processes/threads, mutexes, semaphores, message queues
After reading a bit of each, you should have a basic grasp of all the different aspects of optimization.
Note: I wiki-ed this so people can add book recommendations.
I am going with the idea that optimizing a code is to get the same results in less time. And fully optimized only means they ran out of ideas to make it faster. I throw large buckets of scorn on claims of "fully optimized" code! There's no such thing.
So you want to make your application/program/module run faster? First thing to do (as mentioned earlier) is measure also known as profiling. Do not guess where to optimize. You are not that smart and you will be wrong. My guesses are wrong all the time and large portions of my year are spent profiling and optimizing. So get the computer to do it for you. For PC VTune is a great profiler. I think VS2008 has a built in profiler, but I haven't looked into it. Otherwise measure functions and large pieces of code with performance counters. You'll find sample code for using performance counters on MSDN.
So where are your cycles going? You are probably waiting for data coming from main memory. Go read up on L1 & L2 caches. Understanding how the cache works is half the battle. Hint: Use tight, compact structures that will fit more into a cache-line.
Optimization is lots of fun. And it's never ending too :)
A great book on optimization is Write Great Code: Understanding the Machine by Randall Hyde.
Make sure your application produces correct results before you start optimizing it.

What are some hints that an algorithm should parallelized?

My experience thus far has shown me that even with multi-core processors, parallelizing an algorithm won't always speed it up noticably. In fact, sometimes it can slow things down. What are some good hints that an algorithm can be sped up significantly by being parallelized?
(Of course given the caveats with premature optimization and their correlation to evil)
To gain the most benefit from parallelisation, a task should be able to be broken into similiar-sized course-grain chunks that are independent (or mostly so), and require little communication of data or synchronisation between the chunks.
Fine-grain parallelisation, almost always suffers from increased overheads, and will have a finite speed-up regardless of the number of physical cores available.
[The caveat to this, is those architectures that have a very large no. of 'cores' (such as the connection machines 64,000 cores). These are well suited to calculations that can be broken into relatively simple actions assigned to a particular topology (like a rectangular mesh).]
If you can divide the work into independent parts then it may be parallelized well.
Remember also Amdahl's Law which is a sobering reminder of how little we can expect in terms of performances gains by adding more cores to most programs.
First, check out this paper by the late Jim Gray:
Distributed Computing Economics
Actually, this will clear up some misunderstanding based on what you wrote in the question. Obviously, if the less amenable your problem set is to being discretized, the more difficult it will be.
Any time you have computations that depend on previous computations, it is not a parallel problem. Things like linear image processing, brute force methods, and genetic algorithms are all easily parallelized.
A good analogy is what could you work on that you could get a bunch of friends to do different parts at once? For example, putting ikea furniture together might parallelize well if different people can work on different sections, but rolling wallpaper might not because you need to do walls in sequence.
If you're doing large matrix computations, like simulations involving finite element models, these can often be broken down into smaller pieces in straight-forward ways. Matrix-vector multiplies can benefit well from parallelization, assuming you are dealing with very large matrices. Unless there is a real performance bottleneck that is causing code to run slowly, it's probably not necessary to hassle with parallel processing.
Well, if you need lots of locks for it to work, then its probably one of those difficult algorithms that doesn't parallelise well. Is there any part of the algorithm that can be broken up into separate parts that don't need to touch each other?

Resources