Recall Amdahl’s law on estimating the best possible speedup. Answer the following questions.
You have a program that has 40% of its code parallelized on three processors, and just for this fraction of code, a speedup of 2.3 is achieved. What is the overall speedup?
I'm having trouble understanding the difference between speedup and overall speedup in this question. I know there must be a difference by the way this question is worded.
Q : What is the overall speedup?
Best start not with the original and trivial Amdahl's law formula, but by reading a bit more contemporary view, extending the original, where add-on overhead costs are discussed and also an aspect of atomicity-of-split-work was explained.
Two sections,one accelerated by a "local"-speed-up,one overall result
Your original problem-formulation seems to by-pass there explained sorts of problems with real-world process-orchestration overheads by simply postulating a (net-local)-speedup, where a <PAR>-able Section-under-Review related implementation add-on overhead costs become "hidden", expressed but by a sort of inefficiency of having three-times more resources for code-stream execution, yet having but a 2.3 x speedup, not 3.0 x, so spending more than a theoretical 1/3 of the time on actually also initial set-up (an add-on overhead-time, not present in a pure-[SERIAL] code-execution ) + parallel-processing (doing The_useful_work, now on triple the capacity of the code-execution resources) + also terminating and results-collection back (add-on overhead-times, not present in a pure-[SERIAL] code-execution) into the "main"-code.
"Hiding" these natural cost-of-going into/out-of [PARALLEL]-code-execution section(s) simplifies the homework, yet a proper understanding of the real-life costs is crucial not to spend way more (on setups and all other add-on overhead costs, that are un-avoidable in real-world) than one would ever receive back (from a wish-to-get many-processors-harnessed split-processing speedup)
|-------> time
|START:
| |DONE: 100% of the code
| | |
|______________________________________<SEQ>______60%_|_40%__________________<PAR>-able__|
o--------------------------------------<SEQ>----------o----------------------<PAR>-able--o CPU_x runs both <SEQ> and <PAR>-able sections of code, in a pure [SERIAL] process-flow orchestration, one after another
| |
| |
|-------> time
|START: |
| | |DONE: 100% of the code :
o--------------------------------------<SEQ>----------o | :
| o---------o .. .. .. .. ..CPU_1 runs <PAR>'d code
| o---------o .. .. .. .. ..CPU_2 runs <PAR>'d code
| o---------o .. .. .. .. ..CPU_3 runs <PAR>'d code
| | |
| | |
| <_not_1/3_> just ~ 2.3x faster (not 3x) perhaps reflects real-costs (penalisations) of new, add-on, process-organisation related setup + termination overheads
|______________________________________<SEQ>______60%_|_________|~ 40% / 2.3x ~ 17.39% i.e. the <PAR>-section has gained a local ( "net"-section ) speedup of 2.3x instead of 3.0x, achievable on 3-CPU-code-execution streams
| | |
Net overall speedup ( if no other process-organisation releated add-on overhead costs were accrued )
is:
( 60% + ( 40% / 1.0 ) )
---------------------------- ~ 1.2921 x
( 60% + ( 40% / 2.3 ) )
Related
I've been reading about an interesting machine learning algorithm, MARS(Multi-variate adaptive regression splines).
As far as I understand the algorithm, from Wikipedia and Friedman's papers, it works in two stages, forward pass and backward pass. I'll ignore backward pass for now, since forward pass is the part I'm interested in. The steps for forward pass, as far as I can tell are.
Start with just the mean of the data.
Generate a new term pair, through exhaustive search
Repeat 2 while improvements are being made
And to generate a term pair MARS appears to do the following:
Select an existing term (e)
Select a variable (x)
Select a value of that variable (v)
Return two terms one of the form e*max(0,x-v) and the other of the form e*max(0, v-x)
And this makes sense to me. I could see how, for example, a data table like this:
+---+---+---+
| A | B | Z |
+---+---+---+
| 5 | 6 | 1 |
| 7 | 2 | 2 |
| 3 | 1 | 3 |
+---+---+---+
Could produce a terms like 2*max(0, B-1) or even 8*max(0, B-1)*max(3-A). However, the wikipedia page has an example that I don't understand. It has an ozone example where the first term is 25. However, it also has term in the final regression that has a coefficient that is negative and fractional. I don't see how this is possible, since the initial term is 5, and you can only multiply by previous terms, and no previous term can have a negative coefficient, that you could ever end up with one...
What am I missing?
As I see it, either I misunderstand term generation, or I misunderstand the simplification process. However, simplification as described seems to only delete terms, not modify them. Can you see what I am missing here?
I'm working on a CUDA app that makes use of all available RAM on the card, and am trying to figure out different ways to reduce cache misses.
The problem domain consists of a large 2- or 3-D grid, depending on the type of problem being solved. (For those interested, it's an FDTD simulator). Each element depends on either two or four elements in "parallel" arrays (that is, another array of nearly identical dimensions), so the kernels must access either three or six different arrays.
The Problem
*Hopefully this isn't "too localized". Feel free to edit the question
The relationship between the three arrays can be visualized as (apologize for the mediocre ASCII art)
A[0,0] -C[0,0]- A ---- C ---- A ---- C ---- A
| | | |
| | | |
B[0,0] B B B
| | | |
| | | |
A ---- C ---- A ---- C ---- A ---- C ---- A
| | | |
| | | |
B B B B
| | | |
| | | |
A ---- C ---- A ---- C ---- A ---- C ---- A
| | | |
| | | |
B B B B[3,2]
| | | |
| | | |
A ---- C ---- A ---- C ---- A ---- C ---- A[3,3]
[2,3]
Items connected by lines are coupled. As can be seen above, A[] depends on both B[] and C[], while B[] depends only on A[], as does C[]. All of A[] is updated in the first kernel, and all of B[] and C[] are updated in a second pass.
If I declare these arrays as simple 2D arrays, I wind up with strided memory access. For a very large domain size (3x3 +- 1 in the grid above), this causes occupancy and performance deficiencies.
So, I thought about rearranging the array layout in a Z-order curve:
Also, it would be fairly trivial to interleave these into one array, which should improve fetch performance since (depending on the interleave order) at least half of the elements required for a given cell update would be close to one another. However, it's not clear to me if GPU uses multiple data pointers when accessing multiple arrays. If so, this imagined benefit could actually be a hindrance.
The Questions
I've read that NVidia does this automatically behind the scenes when using texture memory, or a cudaArray. If this is not the case, should I expect the increased latency when crossing large spans (when the Z curve goes from upper right to bottom left at a high subdivision level) to eliminate the benefit of the locality in smaller grids?
Dividing the grid into smaller blocks that can fit in shared memory should certainly help, and the Z order makes this fairly trivial. Should I have a separate kernel pass that updates boundaries between blocks? Will the overhead of launching another kernel be significant compared to the savings I expect ?
Is there any real benefit to using a 2D vs 1D array? I expect memory to be linear, but am unsure if there is any real meaning to the 2D memory layout metaphor that's often used in CUDA literature.
Wow - long question. Thanks for reading and answering any/all of this.
Just to get this off of the unanswered list:
After a lot of benchmarking and playing with different arrangements, the fastest approach I found was to keep the arrays interleaved in z-order so that most of the values required by a thread were located near each other in RAM. This improved cache behavior (and thus performance). Obviously there are many cases where Z order fails to keep required values close together. I wonder if rotating quadrants to reduce "distance" between the end of a Z and the next quadrant, but I haven't tried that.
Thanks to everyone for the advice.
I have a mechanism in place to find the execution time and memory utilization of a program.
I have a list of programs(source code) and I need to find the best performing among them.
prog | memory(kb) | time(sec)
1 1200 0.05
2 2200 0.10
3 1970 0.55
Is there a formula?
I will not answer your question directly since this smells like homework ;P
But I will give you hint on what to read in-order to solve this
http://en.wikipedia.org/wiki/Big_O_notation
Good luck
if it is java programme then you can use jProfiller
I have a script in a game with a function that gets called every second. Distances between player objects and other game objects are calculated every second there. The problem is that there can be thoretically 800 function calls in 1 second(max 40 players * 2 main objects(1 up to 10 sub-objects)). I have to optimize this function for less processing. this is my current function:
local square = math.sqrt;
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return square(x*x+y*y+z*z);
end;
-- for example followed by: for i = 800, 1 do getDistance(posA, posB); end
I found out, that the localization of the math.sqrt function through
local square = math.sqrt;
is a big optimization regarding to the speed, and the code
x*x+y*y+z*z
is faster than this code:
x^2+y^2+z^2
I don't know if the localization of x, y and z is better than using the class method "." twice, so maybe square(a.x*b.x+a.y*b.y+a.z*b.z) is better than the code local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
square(x*x+y*y+z*z);
Is there a better way in maths to calculate the vector length or are there more performance tips in Lua?
You should read Roberto Ierusalimschy's Lua Performance Tips (Roberto is the chief architect of Lua). It touches some of the small optimizations you're asking about (such as localizing library functions and replacing exponents with their mutiplicative equivalents). Most importantly, it conveys one of the most important and overlooked ideas in engineering: sometimes the best solution involves changing your problem. You're not going to fix a 30-million-calculation leak by reducing the number of CPU cycles the calculation takes.
In your specific case of distance calculation, you'll find it's best to make your primitive calculation return the intermediate sum representing squared distance and allow the use case to call the final Pythagorean step only if they need it, which they often don't (for instance, you don't need to perform the square root to compare which of two squared lengths is longer).
This really should come before any discussion of optimization, though: don't worry about problems that aren't the problem. Rather than scouring your code for any possible issues, jump directly to fixing the biggest one - and if performance is outpacing missing functionality, bugs and/or UX shortcomings for your most glaring issue, it's nigh-impossible for micro-inefficiencies to have piled up to the point of outpacing a single bottleneck statement.
Or, as the opening of the cited article states:
In Lua, as in any other programming language, we should always follow the two
maxims of program optimization:
Rule #1: Don’t do it.
Rule #2: Don’t do it yet. (for experts only)
I honestly doubt these kinds of micro-optimizations really help any.
You should be focusing on your algorithms instead, like for example get rid of some distance calculations through pruning, stop calculating the square roots of values for comparison (tip: if a^2<b^2 and a>0 and b>0, then a<b), etc etc
Your "brute force" approach doesn't scale well.
What I mean by that is that every new object/player included in the system increases the number of operations significantly:
+---------+--------------+
| objects | calculations |
+---------+--------------+
| 40 | 1600 |
| 45 | 2025 |
| 50 | 2500 |
| 55 | 3025 |
| 60 | 3600 |
... ... ...
| 100 | 10000 |
+---------+--------------+
If you keep comparing "everything with everything", your algorithm will start taking more and more CPU cycles, in a cuadratic way.
The best option you have for optimizing your code isn't not in "fine tuning" the math operations or using local variables instead of references.
What will really boost your algorithm will be eliminating calculations that you don't need.
The most obvious example would be not calculating the distance between Player1 and Player2 if you already have calculated the distance between Player2 and Player1. This simple optimization should reduce your time by a half.
Another very common implementation consists in dividing the space into "zones". When two objects are on the same zone, you calculate the space between them normally. When they are in different zones, you use an approximation. The ideal way of dividing the space will depend on your context; an example would be dividing the space into a grid, and for players on different squares, use the distance between the centers of their squares, that you have computed in advance).
There's a whole branch in programming dealing with this issue; It's called Space Partitioning. Give this a look:
http://en.wikipedia.org/wiki/Space_partitioning
Seriously?
Running 800 of those calculations should not take more than 0.001 second - even in Lua on a phone.
Did you do some profiling to see if it's really slowing you down? Did you replace that function with "return (0)" to verify performance improves (yes, function will be lost).
Are you sure it's run every second and not every millisecond?
I haven't see an issue running 800 of anything simple in 1 second since like 1987.
If you want to calc sqrt for positive number a, take a recursive sequense
x_0 = a
x_n+1 = 1/2 * (x_n + a / x_n)
x_n goes to sqrt(a) with n -> infinity. first several iterations should be fast enough.
BTW! Maybe you'll try to use the following formula for length of vector instesd of standart.
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return x+y+z;
end;
It's much more easier to compute and in some cases (e.g. if distance is needed to know whether two object are close) it may act adequate.
At work we are looking into common problems that lead to high cyclomatic complexity. For example, having a large if-else statement can lead to high cyclomatic complexity, but can be resolved by replacing conditionals with polymorphism. What other examples have you found?
See the NDepend's definition of Cyclomatic Complexity.
Nesting Depth is also a great code metric.
Cyclomatic complexity is a popular procedural software metric equal to the number of decisions that can be taken in a procedure. Concretely, in C# the CC of a method is 1 + {the number of following expressions found in the body of the method}:
if | while | for | foreach | case | default | continue | goto | && | || | catch | ternary operator ?: | ??
Following expressions are not counted for CC computation:
else | do | switch | try | using | throw | finally | return | object creation | method call | field access
Adapted to the OO world, this metric is defined both on methods and classes/structures (as the sum of its methods CC). Notice that the CC of an anonymous method is not counted when computing the CC of its outer method.
Recommendations: Methods where CC is higher than 15 are hard to understand and maintain. Methods where CC is higher than 30 are extremely complex and should be split in smaller methods (except if they are automatically generated by a tool).
Another example to avoid using so many if´s, it's the implementation of a Finite State Machine. Because events fire transitions, so the conditionals are implicit in a clearer way with these transitions that changes the state of the System. The control is easier.
Leave you a link where mentions some of it´s benefits:
http://www.skorks.com/2011/09/why-developers-never-use-state-machines/