I have a mechanism in place to find the execution time and memory utilization of a program.
I have a list of programs(source code) and I need to find the best performing among them.
prog | memory(kb) | time(sec)
1 1200 0.05
2 2200 0.10
3 1970 0.55
Is there a formula?
I will not answer your question directly since this smells like homework ;P
But I will give you hint on what to read in-order to solve this
http://en.wikipedia.org/wiki/Big_O_notation
Good luck
if it is java programme then you can use jProfiller
Related
I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:
grep "cappucino" filename
I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.
Please correct me if I'm wrong:
I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?
The short answer is: no.
The longer answer is: it depends.
The even longer answer is: grep's performance depends on a lot of things:
are you running a fixed string search (-F, fgrep) or not - grep uses Boyer-Moore algorithm which by itself isn't capable of finding regular expressions so what grep does (or at least used to do) is it first finds a fixed string in your regexp, tries to find it using BM in the text and do a regexp match (not sure about the current implementation whether it uses an NFA or a DFA implementation, probably a hybrid)
how long is your pattern - BM works faster for longer patterns
how many matches will you have - the less the matches the faster it will be
what is your CPU and memory - hard drive will help you only during reading not during computation time
what other options are you using with your grep
14 hours might not even be your lower bound because Boyer-Moore is smart enough to compute an offset at which next possible match might occur so it doesn't need to read-in the whole file. This does depend on the implementation though and is just my speculation. After re-running the below test with a much longer pattern I was able to go down to 0.23sec and I don't think my disk is that fast. But there might be some caching involved instead.
For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:
With 808320 hits
real 0m1.734s
user 0m1.334s
sys 0m0.120s
With 0 hits:
real 0m0.059s
user 0m0.046s
sys 0m0.016s
#Edit: in short read about Boyer-Moore :-)
#Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.
Here is an excerpt from the Mastering Dyalog APL book, from the chapter on Inner Products:
HMS is a variable which contains duration in Hours, Minutes, and Seconds: HMS ← 3 44 29 Chapter J – Operators 397
We would like to convert it into seconds. We shall see 3 methods just now, and a 4th
method
will be given in another chapter.
A horrible solution (3600×HMS[1]) + (60×HMS[2]) + HMS[3]
A good APL solution +/ 3600 60 1 × HMS
An excellent solution with Inner Product 3600 60 1 +.× HMS
It then says that The second and third solutions are equivalent in terms of number of characters typed and performance.
As I understand it, APL programmers should generally use Inner Product, as well as Outer Product, as much as possible. Is that correct?
Can you give an example when using Inner Product would lead to performance gains? What exactly happens when I use Inner Product (on a lower level)? Is the first solution presented below horrible just because it doesn't use APL syntax in a proper way or does it actually have worse performance?
I know there are few questions but want I am asking about in general is how the Inner/Outer Products work and when exactly should an APL programmer use them.
We’ve done work to optimize both the +/ and the +.×.
MBaas is right in that it happens that the +/ in this instance is slightly better than the +.×
Our general advice is: use the constructs in the language best suited for the job, and eventually the implementation will catch up.
The "horrible" solution is considered bad as it does not use array thinking.
Regards,
Vince, Dyalog Support
APL programmers should generally use Inner Product, as well as Outer Product, as much as possible. Is that correct?
It is really up to the APL programmer and the task at hand, but if something makes APL code more concise and efficient, I don't see why a programmer wouldn't opt for it.
In this particular case 60⊥HMS is even more concise and efficient than the inner product.
Can you give an example when using Inner Product would lead to performance gains?
As typical in array-oriented programming, performance gains are achieved by doing things in one go.
Most APL functions are implicit loops---their implementation uses a counter, a limit for it, and an increment step.
The shorter your code is, the better, because not only it's easier to hold in one's head, it's also more efficient as the interpreter has to do fewer passes over the data.
Some implementations do loop fusion in an attempt to reduce this overhead.
Some have idiom recognition---certain combinations of squiggles are special-cased in the interpreter. Doing things in one go also allows the interpreter to do clever optimisations like using the SSE instruction set or GPUs.
Coming back to inner product, let's take the example of A f.g B where A and B are vectors and see how f and g are applied (in Dyalog):
f←{⎕←(⍕⍺),' f ',⍕⍵ ⋄ ⍺+⍵}
g←{⎕←(⍕⍺),' g ',⍕⍵ ⋄ ⍺×⍵}
0 1 2 3 4 f.g 5 6 7 8 9
4 g 9
3 g 8
24 f 36
2 g 7
14 f 60
1 g 6
6 f 74
0 g 5
0 f 80
80
You can see from the above that calls to f and g are interleaved. The interpreter apples f and reduces on g simultaneously, in one pass, avoiding the creation of a temporary array, like f/ A g B would do.
Another example: http://archive.vector.org.uk/art10500200
You can test the performance of different solutions for yourself and see which one works best:
)copy dfns.dws cmpx
⍝ or: ")copy dfns cmpx" if you are using Windows
HMS ← 3 44 29
cmpx '(3600×HMS[1]) + (60×HMS[2]) + HMS[3]' '+/ 3600 60 1 × HMS' '3600 60 1 +.× HMS' '60⊥HMS'
(3600×HMS[1]) + (60×HMS[2]) + HMS[3] → 2.7E¯6 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
+/ 3600 60 1 × HMS → 9.3E¯7 | -66% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
3600 60 1 +.× HMS → 8.9E¯7 | -68% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
60⊥HMS → 4.8E¯7 | -83% ⎕⎕⎕⎕⎕⎕⎕
The problem with generalization is that they might be incorrect, but as rule of thumb I'd say using the inner & outer products will benefit readability as well as performance ;-)
Now, looking at the thing in practice:
` ]performance.RunTime (3600×HMS[1])+(60×HMS[2])+HMS[3] -repeat=100000
Benchmarking "(3600×HMS[1])+(60×HMS[2])+HMS[3]", repeat=100000
Exp
CPU (avg): 0.001558503836
Elapsed: 0.001618446292
]performance.RunTime '+/ 3600 60 1 × HMS' -repeat=100000
Benchmarking "+/ 3600 60 1 × HMS", repeat=100000
Exp
CPU (avg): 0.0004698496481
Elapsed: 0.0004698496481
`
That is quite a difference - if you repeat it enough times to be measureable ;-)
But of course with larger dataset the advantage gets more visible!
Let's also look at the 3variant:
` ]performance.RunTime '3600 60 1 +.× HMS' -repeat=100000
Benchmarking "3600 60 1 +.× HMS", repeat=100000
Exp
CPU (avg): 0.0004698496481
Elapsed: 0.000439859245
`
No difference here, but again - with "real data" (larger array) you should see a much clearer difference. I think a simple explanation is that inner product is like one 'statement' for the interpreter, whereas the first variant has 3 single multiplications, indexing and needs to consider priorities (brackets) and then sum up that vector, which sounds like a lot of sweat ;-)
The 2nd statement has one multiplication only (for a vector), so it elimates several steps already, and the inner product enables the interpreter to possibly combine some of its internal working to do his job even faster.
BUT now here's a surprise:
v1←(10000/3600 60 1) ⋄v2← 10000/HMS
]performance.RunTime '+/v1 × v2' 'v1 +.× v2' -repeat=100000 -compare
+/v1 × v2 → 6.0E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
v1 +.× v2 → 6.3E¯5 | +5% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
I expected that the bigger arguments would help to make the last expression's performance-advantage more visible - but actually #2 won. Maybe Dyalog optimized case #2 more than #3... ;-)
I am testing 4 different algorithms in ruby and I"m having trouble interpreting the results that are returned from ruby's benchmark utility.
I ran each set of algorithms twice, once using Benchmark.bm and then again using Benchmark.bmbm. Here are the results:
Benchmark.bm:
real stime total utime
1 214.91 3.44 154.93 151.48
2 208.85 3.03 161.37 158.34
3 224.40 3.23 161.63 158.41
4 234.02 3.34 163.49 160.16
Benchmark.bmbm:
real stime total utime
1 252.61 3.50 163.89 160.39
2 278.56 3.65 164.61 160.96
3 241.89 3.37 162.73 159.36
4 256.12 3.56 163.91 160.35
Which algorithm performs the best(1,2,3 or 4) according to these results? What are the practical differences between using Benchmark.bm and Benchmark.bmbm.
Apologies if this is answered elsewhere, but I couldn't find a direct answer.
There don't seem to be significant differences between the algorithms. What's usually the most relevant number is 'total', which is the total time spent by the CPU on running your code. These are all pretty close to each other.
The difference between Benchmark.bm and Benchmark.bmbm is that the latter runs the benchmark, throws away the result, then runs it again and returns this second run's results. The reason to do this is to avoid unfair disadvantage to some of the subjects: it is possible that there exist common resources between the subjects that only one of them needs to initialize, and with bmbm you have a better chance that everything will be in a 'warm' state and thus obtain a fair result.
I have a script in a game with a function that gets called every second. Distances between player objects and other game objects are calculated every second there. The problem is that there can be thoretically 800 function calls in 1 second(max 40 players * 2 main objects(1 up to 10 sub-objects)). I have to optimize this function for less processing. this is my current function:
local square = math.sqrt;
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return square(x*x+y*y+z*z);
end;
-- for example followed by: for i = 800, 1 do getDistance(posA, posB); end
I found out, that the localization of the math.sqrt function through
local square = math.sqrt;
is a big optimization regarding to the speed, and the code
x*x+y*y+z*z
is faster than this code:
x^2+y^2+z^2
I don't know if the localization of x, y and z is better than using the class method "." twice, so maybe square(a.x*b.x+a.y*b.y+a.z*b.z) is better than the code local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
square(x*x+y*y+z*z);
Is there a better way in maths to calculate the vector length or are there more performance tips in Lua?
You should read Roberto Ierusalimschy's Lua Performance Tips (Roberto is the chief architect of Lua). It touches some of the small optimizations you're asking about (such as localizing library functions and replacing exponents with their mutiplicative equivalents). Most importantly, it conveys one of the most important and overlooked ideas in engineering: sometimes the best solution involves changing your problem. You're not going to fix a 30-million-calculation leak by reducing the number of CPU cycles the calculation takes.
In your specific case of distance calculation, you'll find it's best to make your primitive calculation return the intermediate sum representing squared distance and allow the use case to call the final Pythagorean step only if they need it, which they often don't (for instance, you don't need to perform the square root to compare which of two squared lengths is longer).
This really should come before any discussion of optimization, though: don't worry about problems that aren't the problem. Rather than scouring your code for any possible issues, jump directly to fixing the biggest one - and if performance is outpacing missing functionality, bugs and/or UX shortcomings for your most glaring issue, it's nigh-impossible for micro-inefficiencies to have piled up to the point of outpacing a single bottleneck statement.
Or, as the opening of the cited article states:
In Lua, as in any other programming language, we should always follow the two
maxims of program optimization:
Rule #1: Don’t do it.
Rule #2: Don’t do it yet. (for experts only)
I honestly doubt these kinds of micro-optimizations really help any.
You should be focusing on your algorithms instead, like for example get rid of some distance calculations through pruning, stop calculating the square roots of values for comparison (tip: if a^2<b^2 and a>0 and b>0, then a<b), etc etc
Your "brute force" approach doesn't scale well.
What I mean by that is that every new object/player included in the system increases the number of operations significantly:
+---------+--------------+
| objects | calculations |
+---------+--------------+
| 40 | 1600 |
| 45 | 2025 |
| 50 | 2500 |
| 55 | 3025 |
| 60 | 3600 |
... ... ...
| 100 | 10000 |
+---------+--------------+
If you keep comparing "everything with everything", your algorithm will start taking more and more CPU cycles, in a cuadratic way.
The best option you have for optimizing your code isn't not in "fine tuning" the math operations or using local variables instead of references.
What will really boost your algorithm will be eliminating calculations that you don't need.
The most obvious example would be not calculating the distance between Player1 and Player2 if you already have calculated the distance between Player2 and Player1. This simple optimization should reduce your time by a half.
Another very common implementation consists in dividing the space into "zones". When two objects are on the same zone, you calculate the space between them normally. When they are in different zones, you use an approximation. The ideal way of dividing the space will depend on your context; an example would be dividing the space into a grid, and for players on different squares, use the distance between the centers of their squares, that you have computed in advance).
There's a whole branch in programming dealing with this issue; It's called Space Partitioning. Give this a look:
http://en.wikipedia.org/wiki/Space_partitioning
Seriously?
Running 800 of those calculations should not take more than 0.001 second - even in Lua on a phone.
Did you do some profiling to see if it's really slowing you down? Did you replace that function with "return (0)" to verify performance improves (yes, function will be lost).
Are you sure it's run every second and not every millisecond?
I haven't see an issue running 800 of anything simple in 1 second since like 1987.
If you want to calc sqrt for positive number a, take a recursive sequense
x_0 = a
x_n+1 = 1/2 * (x_n + a / x_n)
x_n goes to sqrt(a) with n -> infinity. first several iterations should be fast enough.
BTW! Maybe you'll try to use the following formula for length of vector instesd of standart.
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return x+y+z;
end;
It's much more easier to compute and in some cases (e.g. if distance is needed to know whether two object are close) it may act adequate.
I'm optimizing some frequently run Perl code (once per day per file).
Do comments slow Perl scripts down? My experiments lean towards no:
use Benchmark;
timethese(20000000, {
'comments' => '$b=1;
# comment ... (100 times)
', 'nocomments' => '$b=1;'});
Gives pretty much identical values (apart from noise).
Benchmark: timing 10000000 iterations of comments, nocomments...
comments: 1 wallclock secs ( 0.53 usr + 0.00 sys = 0.53 CPU) # 18832391.71/s (n=10000000)
nocomments: 0 wallclock secs ( 0.44 usr + 0.00 sys = 0.44 CPU) # 22935779.82/s (n=10000000)
Benchmark: timing 20000000 iterations of comments, nocomments...
comments: 0 wallclock secs ( 0.86 usr + -0.01 sys = 0.84 CPU) # 23696682.46/s (n=20000000)
nocomments: 1 wallclock secs ( 0.90 usr + 0.00 sys = 0.90 CPU) # 22099447.51/s (n=20000000)
I get similar results if I run the comments and no-comments versions as separate Perl scripts.
It seems counter-intuitive though, if nothing else the interpreter needs to read the comments into memory every time.
Runtime performance? No.
Parsing and lexing performance? Yes, of course.
Since Perl tends to parse and lex on the fly, then comments will affect "start up" performance.
Will they affect it noticably? Unlikely.
Perl is a just-in-time compiled language, so comments and POD have no effect on run-time performance.
Comments and POD have a minuscule effect on compile-time, but they're so easy and fast for Perl to parse it's almost impossible to measure the performance hit. You can see this for yourself by using the -c flag to just compile.
On my Macbook, a Perl program with 2 statements and 1000 lines of 70 character comments takes the same time to compile as one with 1000 lines of empty comments as one with just 2 print statements. Be sure to run each benchmark twice to allow your OS to cache the file, otherwise what you're benchmarking is the time to read the file from the disk.
If startup time is a problem for you, it's not because of comments and POD.
Perl compiles a script and then executes it. Comments marginally slow the compile phase, but have zero effect on the run phase.
Perl is not a scripting language in the same sense that shell scripts are. The interpreter does not read the file line by line. The execution of a Perl program is done in two basic stages: compilation and runtime [1]. During the compilation stage the source code is parsed and converted into bytecode. During the runtime stage the bytecode is executed on a virtual machine.
Comments will slow down the parsing stage but the difference is negligible compared to the time required to parse the script itself (which is already very small for most programs). About the only time you're really concerned with parsing time is in a webserver environment where the program could be called many times per second. mod_perl exists to solve this problem.
You're using Benchmark. That's good! You should be looking for ways to improve the algorithm -- not micro-optimizing. Devel::DProf might be helpful to find any hot spots. You absolutely should not strip comments in a misguided attempt to make your program faster. You'll just make it unmaintainable.
[1] This is commonly called "just in time" compilation. Perl actually has several more stages like INIT and END that don't matter here.
The point is: optimize bottlenecks. Reading in a file consists of:
opening the file,
reading in its contents,
closing the file,
parsing the contents.
Of these steps, reading is the fastest part by far (I am not sure about closing, it is a syscall, but you don't have to wait for it to finish). Even if it is 10% of the whole thing (which is is not, I think), then reducing it by half only gives 5% improved performance, at the cost of missing comments (which is a very bad thing). For the parser, throwing away a line that begins with # is not a tangible slowdown. And after that, the comments are gone, so there can be no slowdown.
Now, imagine that you could actually improve the "reading in the script" part by 5% through stripping all comments (which is a really optimistic estimate, see above). How big is the share of "reading in the script" in overall time consumption of the script? Depends on how much it does, of course, but since perl scripts usually read at least one more file, it is 50% at most, but since perl scripts usually do something more, an honest estimate will bring this down to something in the range of 1%. So, the expected efficiency improvement by stripping all comments is at most (very optimistic) 2.5%, but really closer to 0.05%. And then, those where it actually gives more than 1% are already fast since they do almost nothing, so you are again optimizing at the wrong point.
Concluding, optimize bottlenecks.
The Benchmark module is useless in this case. It's only measuring the times to run the code over and over again. Since your code doesn't actually do anything, most of it is optimized it away. That's why you're seeing it run 22 million times a second.
I have almost on entire chapter about this in Mastering Perl. The error of measurement in the Benchmark technique is about 7%. Your benchmark numbers are well within that, so there's virtually no difference.
From Paul Tomblins comment:
Doesn't perl do some sort of on-the-fly compilation? Maybe the comments get discarded early? –
Yes Perl does.
It is a programming language in between compiled and interpreted. The code gets compiled on the fly and then run. the comments usually don't make any difference. The most it would probably effect is when it is initially parsing the file line by line and pre compiling it, you might see a nano second difference.
I would expect that the one comment would only get parsed once, not multiple times in the loop, so I doubt it is a valid test.
I would expect that comments would slightly slow compilation, but I expect it would be too minor to bother removing them.
Do Perl comments slow a script down? Well, parsing it, yes. Executing it after parsing it? No. How often is a script parsed? Only once, so if you have a comment within a for loop, the comment is discarded by the parses once, before the script even runs, once it started running, the comment is already gone (and the script is not stored as script internally by Perl), thus no matter how many times the for loop repeats, the comment won't have an influence. How fast can the parser skip over comments? The way Perl comments are done, very fast, thus I doubt you will notice. You will notice a higher start-up time if you have 5 lines of code and between each line 1 Mio lines of comments... but how likely is that and of what use would a comment that large be?