I was doing some benchmarking of Perl performance, and ran into a case that I thought was somewhat odd. Suppose you have a function which uses a value from an array multiple times. In this case, you often see some code as:
sub foo {
my $value = $array[17];
do_something_with($value);
do_something_else_with($value);
}
The alternative is not to create a local variable at all:
sub foo {
do_something_with($array[17]);
do_something_else_with($array[17]);
}
For readability, the first is clearer. I assumed that performance would be at least equal (or better) for the first case too - array lookup requires a multiply-and-add, after all.
Imagine my surprise when this test program showed the opposite. On my machine, re-doing the array lookup is actually faster than storing the result, until I increase ITERATIONS to 7; in other words, for me, creating a local variable is only worthwhile if it's used at least 7 times!
use Benchmark qw(:all);
use constant { ITERATIONS => 4, TIME => -5 };
# sample array
my #array = (1 .. 100);
cmpthese(TIME, {
# local variable version
'local_variable' => sub {
my $index = int(rand(scalar #array));
my $val = $array[$index];
my $ret = '';
for (my $i = 0; $i < ITERATIONS; $i ++) {
$ret .= $val;
}
return $ret;
},
# multiple array access version
'multi_access' => sub {
my $index = int(rand(scalar #array));
my $ret = '';
for (my $i = 0; $i < ITERATIONS; $i ++) {
$ret .= $array[$index];
}
return $ret;
}
});
Result:
Rate local_variable multi_access
local_variable 245647/s -- -5%
multi_access 257907/s 5% --
It's not a HUGE difference, but it brings up my question: why is it slower to create a local variable and cache the array lookup, than to do the lookup again? Reading other S.O. posts, I've seen that other languages / compilers do have the expected outcome, and sometimes even transform these into the same code. What is Perl doing?
I've done more poking around at this today, and what I've determined is that scalar assignment of any sort is an expensive operation, relative to the overhead of one-deep array lookup.
This seems like it's just restating the initial question, but I feel I have found more clarity. If, for example, I modify my local_variable subroutine to do another assignment like so:
my $index = int(rand(scalar #array));
my $val = 0; # <- this is new
$val = $array[$index];
my $ret = '';
...the code suffers an additional 5% speed penalty beyond the single-assignment version - even though it does nothing but a dummy assignment to the variable.
I also tested to see if scope caused setup/teardown of $var to impede performance, by switching it to global instead of local scoped one. The difference is negligible (see comments to #zdim above), pointing away from construct/destruct as the performance bottleneck.
In the end, my confusion was based on faulty assumptions that scalar assignment should be fast. I am used to working in C, where copying a value to a local variable is an extremely quick operation (1-2 asm instructions).
As it turns out, this is not the case in Perl (though I don't know exactly why, it's ok). Scalar assignment is a relatively "slow" operation... Whatever Perl internals are doing to get at the nth element of an Array object is actually quite fast by comparison. The "multiply and add" I mentioned in the initial post is still far less work than the code for scalar assignment.
That is why it takes so many lookups to match the performance of caching the result: simply assigning to the "cache" variable is ~7 times slower (for my setup).
Let's first turn the statement: Caching the lookup is expected to be faster as it avoids the repeated lookups, even as it does cost some, and it starts being faster once more than 7 lookups are done. Now that's not so shocking, I think.
As to why it's slower for fewer than seven iterations ... I'll guess that the cost of the scalar creation is still greater than those few lookups. It is surely greater than one lookup, yes? How about two, then? I'd say that "a few" may well be a good measure.
Related
I've searched other many Stack questions on map however this requirement is particular and well try as I might I cannot quite get the solution I am looking for, or I think that does exist.
This question is simply about performance.
As limited, background, this code segment used in decoding incoming tokens so it's used on every web request and therefore the performance is critical and I know "map" can be used so want to use it.
Here is a trimmed down but nevertheless fully working code segment which I am currently using and works perfectly well:
use strict;
use Data::Dumper qw (Dumper);
my $api_token = { array => [ 'user_id', 'session_id', 'expiry' ], max => 3, name => 'session' };
my $token_got = [ 9923232345812112323, 1111323232000000465, 1002323001752323232 ];
my $rt;
for (my $i=0; $i<scalar #{$api_token->{array}}; $i++) {
$rt->{$api_token->{array}->[$i]} = $token_got->[$i];
}
$rt->{type} = $api_token->{name};
print Dumper ($rt) . "\n";
The question is:
What is the absolute BEST POSSIBLE PERL CODE to replicate the foreach statement above in terms of performance?
Looks like you only need a hash slice
my %rt;
#rt{ #{ $api_token->{array} } } = #$token_got;
Or, if the hash reference is needed
my $rt;
#{ $rt } { #{ $api_token->{array} } } = #$token_got;
or with the newer postfix dereferencing, on both array and hash slices, perhaps a bit nicer
my $rt;
$rt->#{ $api_token->{array}->#* } = #$token_got;
One can also do it using
List::MoreUtils::mesh, and in one statement
my $rt = { mesh #{ $api_token->{array} }, #$token_got };
or with pairwise from the same library
my $rt = { pairwise { $a, $b } #{ $api_token->{array} }, #$token_got };
These go via C code if the library gets installed with List::MoreUtils::XS.
Benchmarked all above, with the tiny datasets from the question (realistic though?), and whatever implementation mesh/pairwise have they are multiple times as slow as the others.
On an old laptop with v5.26
Rate use_pair use_mesh use_href use_post use_hash
use_pair 373639/s -- -36% -67% -67% -68%
use_mesh 580214/s 55% -- -49% -49% -51%
use_href 1129422/s 202% 95% -- -1% -5%
use_post 1140634/s 205% 97% 1% -- -4%
use_hash 1184835/s 217% 104% 5% 4% --
On a server with v5.36 the numbers are around 160%--170% against pairwise (with mesh being a bit faster than it, similarly to above)
Of the others, on the laptop the hash-based one is always a few percent quicker, while on a server with v5.36 they are all very close. Easy to call it a tie.
The following is edit by OP, who timed a 61% speedup (see comments)
CHANGED CODE:
#rt{ #{ $api_token->{array} } } = #$token_got; ### much faster onliner replaced the loop. #zdim credit
I am extending Chess::Play, a little chess framework. I want to implement a method that undoes a move. In do_move() I save relevant state information for using it later to restore the state:
sub do_move {
my ($self, $move) = #_;
# Do some things and call the super method ...
# Return everything needed for restoring the current state.
return (
BOARD => [#{$self->{BOARD}],
CASTLE_OK => {%{$self->{CASTLE_OK}},
COLOR_TO_MOVE => $self->{COLOR_TO_MOVE},
# ... Other properties omitted for brevity.
);
}
It is, of course, crucial to make deep copies of those properties that are not scalars. The original author is doing more or less the same internally in other places.
In undo_move() I restore the state. A shallow copy is enough because the state information is no longer needed.
sub undo_move {
my ($self, $state) = #_;
$self->{BOARD} = $state->{BOARD};
$self->{CASTLE_OK} = $state->{CASTLE_OK};
$self->{COLOR_TO_MOVE} = $state->{COLOR_TO_MOVE};
}
Alternatively, I could assign to a hash slice:
my #keys = keys %$state;
# Using just %$state on the right-hand side is a bug because
# it returns the values in arbitrary order. Therefore a hash
# slice must be used although the entire state has to be copied
# back.
#{$self}{#keys} = #{$state}{#keys};
But is that really more efficient? Or will Perl internally make a shallow copy of $self for that, copying also the hash slots that I don't want/need to touch?
If using the hash slice is more efficient, would it then be better save the state in do_move() with something like this:
return Storable::dclone(#{$self}{qw(BOARD CASTLE_OK COLOR_TO_MOVE)});
It is key for this application that the code is efficient, not elegant or idiomatic, because it will be executed millions of times.
In reality my state information has 8 properties, one array of 144 integers, one array of 16 integers, one hash with 4 keys, one with two keys, and four scalars.
Yes, I should try out both versions and compare the performance. I will do that. But I am interested in what Perl is doing internally here and how to solve such problems in general.
Edit: Normally, chess engines do not save state but rather just undo the modifications to the board. But that is more complicated than it looks at first glance because of castling and en passant captures. My assumption is at the moment that the shorter the code, the faster it will be, so that most of it runs inside Perl's C code and is not interpreted as Perl bytecode.
Update: Added benchmarks. In short: the fastest way to copy (this) data is by copying items one by one. It is clearly faster than using a slice, and for this data much faster than using Storable.
I see two problems with copying data by hand, piece by piece
return (
BOARD => [#{$self->{BOARD}],
CASTLE_OK => {%{$self->{CASTLE_OK}},
COLOR_TO_MOVE => $self->{COLOR_TO_MOVE},
# ... Other properties omitted for brevity.
);
As for efficiency, this code manually dereferences and then constructs back arrays and hashes; no work is left undone, and all data gets copied. I don't see how it could be more efficient than the same job done by the fine tuned C code in Storable, regarded to be fast.†
More importantly, the code copies only "one-level-deep," so to say -- if the arrayref $self->{BOARD} has any references for values then there is a problem, and worse yet it'll be a quiet problem. I assume that that is not the case here but it still leaves me itchy allowing for a potential bug, should that change (the proverbial 6 months later, naturally).
So in principle I'd readily go with Storable.†
However, there's a special case, as clarified in comments: there is absolutely no reason to worry about the depth of the copy as there can only be plain arrays and hashes; and, only a subset of the data structure is needed.
First, I don't see why a hash slice would be (measurably) faster than copying item by item; each key still has to be dereferenced, and its value copied over. I don't know the implementation but I'd expect a "slice" to be a syntax feature, for which the work is done the way we do it manually.
While there may be some optimization with a slice, given that all used keys are known in that statement, when you go key by key the slicing is done already since keys are given one by one.
On a million repetitions? Measure. My gut feeling is that only copying by hand could be faster, but again by a faint measure. Or try
$self->{$_} = $state->{$_} for keys %$state;
to avoid constructing an array (#keys). The postfix-for loop (as a statement modifier) has no scope built so that's another teeny-tiny benefit.
What leaves the question of how to return from do_move, and I'd still go for Storable. Even as there is some extra work of generating a slice the
whole copying would have to be faster in that old C code. — nah, see benchmark -- One can't extract a part of a hashref but it has to be rebuilt and that voids possible advantage of the library, for this simple data at least. (Worse, it turns out that manual copy is faster than Storable even for copying the whole hashref, by a factor of 2. The data here is so simple that "by hand" method has no work to do.)
I haven't measured any of this, and that will of course answer the question. — see benchmarks
Conclusion
The data here is so simple that a manual copy is much faster than using Storable, so do as shown in the question. (And since one has to rebuild only a part of the hashref using Storable is even less effective.) But measure on real data.
Then, rebuild data to return (in undo_move) "by hand," see first benchmark below
Edit This discussion really needs measurements. Here are basic benchmarks
Change parts of a hashref, using "slice" vs writing each key-value "by hand"
use warnings;
use strict;
use feature 'say';
use Storable qw(dclone);
use Benchmark qw(cmpthese);
my $runfor = shift // 3; # second to run-for, see cmpthese(...)
my $data = {
one => [10..12], two => { a => 1, b => 2 },
more => 'some',
};
my $slice = { # use this to change $data
one => [2..6], two => { A => 10, B => 20 },
};
sub by_slice {
my ($data, $slice) = #_;
my #keys = keys %$slice;
#{$data}{#keys} = #{$slice}{#keys};
return $data;
}
sub by_hand {
my ($data, $slice) = #_;
for (qw(one two)) {
$data->{$_} = $slice->{$_}
}
# This can only be faster, but barely
#$data->{$_} = $slice->{$_} for qw(one two);
return $data;
}
my $d1 = dclone $data;
my $d2 = dclone $data;
cmpthese(-$runfor, {
by_slice => sub { my $res_data = by_slice($data, $slice) },
by_hand => sub { my $res_data = by_hand ($data, $slice) },
});
Running this for 10 seconds (progname 10), on a very old laptop with perl 5.16 (CentOS 7)
Rate by_slice by_hand
by_slice 709738/s -- -25%
by_hand 940751/s 33% --
So doing it "by hand" is indeed a bit faster, as guessed but some more. At least in this simple test-case, but I'd expect that no less (and perhaps more?) with more complex data. A part of this is due to constructing that extra array in the 'slice' case.
Copying a part of a complex data structure, by hand vs using Storable
Well, this is a no-contest since there is no efficient way to extract a "part" of a hashref, but one has to rebuild it by pulling out wanted key => value pairs and constructing a hashref with them. Then, while using Storable we have to do all that and call the function to copy data. Given that by-hand copy in this case is very simple it beats the library use hands down (by a factor of 3)
Comparing
sub by_lib {
my ($data) = #_;
return dclone( { one => $data->{one}, two => $data->{two} } );
#return dclone( { map { $_ => $data->{$_} } qw(one two) } ); # little slower
}
sub by_hand {
my ($data) = #_;
return {
one => [ #{ $data->{one} } ],
two => { %{ $data->{two} } },
}
}
by using them in the benchmark program above yields
Rate by_lib by_hand
by_lib 157838/s -- -75%
by_hand 627359/s 297% --
Note added— Just so, I also compared the by-hand-copy with Storable for copying the whole data structure -- and the lib is slower by a factor of 2.
Two morals come out of that, for me: with simple data the overheads of a general purpose library (which has to do a lot of extra work for its generality) are beat by the manual copy; and, needing the slice indeed hurts the library use further, going from x2 to x3.
Speaking of libraries and speed, also see footnote†.
† There is also JSON::XS, which serializes data and comes out about twice as fast as Storable in my test. Thus still half the speed of the manual copying, for simple data.
With it you'd need to JSON-encode data for return-ing from a sub, and to decode on receiving the return
use JSON::XS qw(encode_json decode_json);
sub do_move {
...
return encode_json ...;
}
and
my $ret = decode_json do_move(...);
Many common operations aren't built in to Raku because they can be concisely expressed with a combination of (meta) operators and/or functions. It feels like binary search of a sorted array ought to be expressable in that way (maybe with .rotor? or …?) but I haven't found a particularly good way to do so.
For example, the best I've come up with for searching a sorted array of Pairs is:
sub binary-search(#a, $target) {
when +#a ≤ 1 { #a[0].key == $target ?? #a[0] !! Empty }
&?BLOCK(#a[0..^*/2, */2..*][#a[*/2].key ≤ $target], $target)
}
That's not awful, but I can't shake the feeling that it could be an awfully lot better (both in terms of concision and readability). Can anyone see what elegant combo of operations I might be missing?
Here's one approach that technically meets my requirements (in that the function body it fits on a single normal-length line). [But see the edit below for an improved version.]
sub binary-search(#a, \i is copy = my $=0, :target($t)) {
for +#a/2, */2 … *≤1 {#a[i] cmp $t ?? |() !! return #a[i] with i -= $_ × (#a[i] cmp $t)}
}
# example usage (now slightly different, because it returns the index)
my #a = ((^20 .pick(*)) Z=> 'a'..*).sort;
say #a[binary-search(#a».key, :target(17))];
say #a[binary-search(#a».key, :target(1))];
I'm still not super happy with this code, because it loses a bit of readability – I still feel like there could/should be a concise way to do a binary sort that also clearly expresses the underlying logic. Using a 3-way comparison feels like it's on that track, but still isn't quite there.
[edit: After a bit more thought, I came up with an more readable version of the above using reduce.
sub binary-search(#a, :target(:$t)) {
(#a/2, */2 … *≤.5).reduce({ $^i - $^pt×(#a[$^i] cmp $t || return #a[$^i]) }) && Nil
}
In English, that reads as: for a sequence starting at the midpoint of the array and dropping by 1/2, move your index $^i by the value of the next item in the sequence – with the direction of the move determined by whether the item at that index is greater or lesser than the target. Continue until you find the target (in which case, return it) or you finish the sequence (which means the target wasn't present; return Nil)]
I have implemented the following statistical computation in perl http://en.wikipedia.org/wiki/Fisher_information.
The results are correct. I know this because I have 100's of test cases that match input and output. The problem is that I need to compute this many times every single time I run the script. The average number of calls to this function is around 530. I used Devel::NYTProf to find out this out as well as where the slow parts are. I have optimized the algorithm to only traverse the top half of the matrix and reflect it onto the bottom as they are the same. I'm not a perl expert, but I need to know if there is anything I can try to speed up the perl. This script is distributed to clients so compiling a C file is not an option. Is there another perl library I can try? This needs to be sub second in speed if possible.
More information is $MatrixRef is a matrix of floating point numbers that is $rows by $variables. Here is the NYTProf dump for the function.
#-----------------------------------------------
#
#-----------------------------------------------
sub ComputeXpX
# spent 4.27s within ComputeXpX which was called 526 times, avg 8.13ms/call:
# 526 times (4.27s+0s) by ComputeEfficiency at line 7121, avg 8.13ms/call
{
526 0s my ($MatrixRef, $rows, $variables) = #_;
526 0s my $r = 0;
526 0s my $c = 0;
526 0s my $k = 0;
526 0s my $sum = 0;
526 0s my #xpx = ();
526 11.0ms for ($r = 0; $r < $variables; $r++)
{
14202 19.0ms my #temp = (0) x $variables;
14202 6.01ms push(#xpx, \#temp);
526 0s }
526 7.01ms for ($r = 0; $r < $variables; $r++)
{
14202 144ms for ($c = $r; $c < $variables; $c++)
{
198828 43.0ms $sum = 0;
#for ($k = 0; $k < $rows; $k++)
198828 101ms foreach my $RowRef (#{$MatrixRef})
{
#$sum += $MatrixRef->[$k]->[$r]*$MatrixRef->[$k]->[$c];
6362496 3.77s $sum += $RowRef->[$r]*$RowRef->[$c];
}
198828 80.1ms $xpx[$r]->[$c] = $sum;
#reflect on other side of matrix
198828 82.1ms $xpx[$c]->[$r] = $sum if ($r != $c);
14202 1.00ms }
526 2.00ms }
526 2.00ms return \#xpx;
}
Since each element of the result matrix can be calculated independently, it should be possible to calculate some/all of them in parallel. In other words, none of the instances of the innermost loop depend on the results of any other, so they could run simultaneously on their own threads.
There really isn't much you can do here, without rewriting parts in C, or moving to a better framework for mathematic operations than bare-bone Perl (→ PDL!).
Some minor optimization ideas:
You initialize #xpx with arrayrefs containing zeros. This is unneccessary, as you assign a value to every position either way. If you want to pre-allocate array space, assign to the $#array value:
my #array;
$#array = 100; # preallocate space for 101 scalars
This isn't generally useful, but you can benchmark with and without.
Iterate over ranges; don't use C-style for loops:
for my $c ($r .. $variables - 1) { ... }
Perl scalars aren't very fast for math operations, so offloading the range iteration to lower levels will gain a speedup.
Experiment with changing the order of the loops, and toy around with caching a level of array accesses. Keeping $my $xpx_r = $xpx[$r] around in a scalar will reduce the number of array accesses. If your input is large enough, this translates into a speed gain. Note that this only works when the cached value is a reference.
Remember that perl does very few “big” optimizations, and that the opcode tree produced by compilation closely resembles your source code.
Edit: On threading
Perl threads are heavyweight beasts that literally clone the current interpreter. It is very much like forking.
Sharing data structures across thread boundaries is possible (use threads::shared; my $variable :shared = "foo") but there are various pitfalls. It is cleaner to pass data around in a Thread::Queue.
Splitting the calculation of one product over multiple threads could end up with your threads doing more communication than calculation. You could benchmark a solution that divides responsibility for certain rows between the threads. But I think recombining the solutions efficiently would be difficult here.
More likely to be useful is to have a bunch of worker threads running from the beginning. All threads listen to a queue which contains a pair of a matrix and a return queue. The worker would then dequeue a problem, and send back the solution. Multiple calculations could be run in parallel, but a single matrix multiplication will be slower. Your other code would have to be refactored significantly to take advantage of the parallelism.
Untested code:
use strict; use warnings; use threads; use Thread::Queue;
# spawn worker threads:
my $problem_queue = Thread::Queue->new;
my #threads = map threads->new(\&worker, $problem_queue), 1..3; # make 3 workers
# automatically close threads when program exits
END {
$problem_queue->enqueue((undef) x #threads);
$_->join for #threads;
}
# This is the wrapper around the threading,
# and can be called exactly as ComputeXpX
sub async_XpX {
my $return_queue = Thread::Queue->new();
$problem_queue->enqueue([$return_queue, #_]);
return sub { $return_queue->dequeue };
}
# The main loop of worker threads
sub worker {
my ($queue) = #_;
while(defined(my $problem = $queue->dequeue)) {
my ($return, #args) = #$problem;
$return->enqueue(ComputeXpX(#args));
}
}
sub ComputeXpX { ... } # as before
The async_XpX returns a coderef that will eventually collect the result of the computation. This allows us to carry on with other stuff until we need the result.
# start two calculations
my $future1 = async_XpX(...);
my $future2 = async_XpX(...);
...; # do something else
# collect the results
my $result1 = $future1->();
my $result2 = $future2->();
I benchmarked the bare-bones threading code without doing actual calculations, and the communication is about as expensive as the calculations. I.e. with a bit of luck, you may start to get a benefit on a machine with at least four processors/kernel threads.
A note on profiling threaded code: I know of no way to do that elegantly. Benchmarking threaded code, but profiling with single-threaded test cases may be preferable.
The core perl function rand() is not thread-safe, and I need random numbers in a threaded monte carlo simulation.
I'm having trouble finding any notes in CPAN on the various random-number generators there as to which (if any) are thread-safe, and every google search I do keeps getting cluttered with C/C++/python/anything but perl. Any suggestions?
Do not use built-in rand for Monte Carlo on Windows. At least, try:
my %r = map { rand() => undef } 1 .. 1_000_000;
print scalar keys %r, "\n";
If nothing has changed, it should print 32768 which is utterly unsuitable for any kind of serious work. And, even if it does print a larger number, you're better off sticking with a PRNG with known good qualities for simulation.
You can use Math::Random::MT.
You can instantiate a new Math::Random::MT object in each thread with its own array of seeds. Mersenne Twister has good properties for simulation.
Do you have /dev/urandom on your system?
BEGIN {
open URANDOM, '<', '/dev/urandom';
}
sub urand { # drop in replacement for rand.
my $expr = shift || 1;
my $x;
read URANDOM, $x, 4;
return $expr * unpack("I", $x) / (2**32);
}
rand is thread safe, and I think you got the wrong definition of what "thread safe" means, If its not "thread safe" It means the program/function is modifying its "shared" data structure that makes its execution in thread mode unsafe.
Check Rand function documentation, Notice it take EXPR as argument, in every thread you can provide a different EXPR.
http://perldoc.perl.org/functions/rand.html