Perl lookup data in hashs table faster - performance

I use code like this to find data values for my calculations:
sub get_data {
$x =0 if($_[1] eq "A"); #get column number by name
$data{'A'}= [2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12];
return $data{$_[0]}[$x];
}
Data is stored like this in Perl file. I plan no more than 100 columns. Then to get value I call:
get_data(column, row);
Now I realized that that is terribly slow way to look up data in table. How can I do it faster? SQL?

Looking at your github code, the main problem you have is that your
big hash of arrays is initialized every time the function is called.
Your current code:
my #atom;
# {'name'}= radius, depth, solvation_parameter, volume, covalent_radius, hydrophobic, H_acceptor, MW
$atom{'C'}= [2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12];
$atom{'A'}= [2.00000, 0.15000, -0.00052, 33.51030, 0.77, 0, 0, ''];
$atom{'N'}= [1.75000, 0.16000, -0.00162, 22.44930, 0.75, 0, 1, 14];
$atom{'O'}= [1.60000, 0.20000, -0.00251, 17.15730, 0.73, 0, 1, 16];
...
Time taken for your test case on the slow netbook I'm typing this on: 6m24.400s.
The most important thing to do is to move this out of the function, so it's
initialized only once, when the module is loaded.
Time taken after this simple change: 1m20.714s.
But since I'm making suggestions, you could write it more legibly:
my %atom = (
C => [ 2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12 ],
A => [ 2.00000, 0.15000, -0.00052, 33.51030, 0.77, 0, 0, '' ],
...
);
Note that %atom is a hash in both cases, so your code doesn't do what you
were imagining: it declares a lexically-scoped array #atom, which is unused, then proceeds to fill up an unrelated global variable %atom. (Also do you really want an empty string for MW of A? And what kind of atom is A anyway?)
Secondly, your name-to-array-index mapping is also slow. Current code:
#take correct value from data table
$x = 0 if($_[1] eq "radius");
$x = 1 if($_[1] eq "depth");
$x = 2 if($_[1] eq "solvation_parameter");
$x = 3 if($_[1] eq "volume");
$x = 4 if($_[1] eq "covalent_radius");
$x = 5 if($_[1] eq "hydrophobic");
$x = 6 if($_[1] eq "H_acceptor");
$x = 7 if($_[1] eq "MW");
This is much better done as a hash (again, initialized outside the function):
my %index = (
radius => 0,
depth => 1,
solvation_parameter => 2,
volume => 3,
covalent_radius => 4,
hydrophobic => 5,
H_acceptor => 6,
MW => 7
);
Or you could be snazzy if you wanted:
my %index = map { [qw[radius depth solvation_parameter volume
covalent_radius hydrophobic H_acceptor MW
]]->[$_] => $_ } 0..7;
Either way, the code inside the function is then simply:
$x = $index{$_[1]};
Time now: 1m13.449s.
Another approach is just to define your field numbers as constants.
Constants are capitalized by convention:
use constant RADIUS=>0, DEPTH=>1, ...;
Then the code in the function is
$x = $_[1];
and you then need to call the function using the constants instead of strings:
get_atom_parameter('C', RADIUS);
I haven't tried this.
But stepping back a bit and looking at how you are using this function:
while($ligand_atom[$x]{'atom_type'}[0]) {
print STDERR $ligand_atom[$x]{'atom_type'}[0];
$y=0;
while($protein_atom[$y]) {
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))
- get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');
- get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius');
$y++;
}
$x++;
print STDERR ".";
}
Each time through the loop you are calling get_atom_parameter twice to
retrieve the radius.
But for the inner loop, one atom is constant throughout. So hoist the call
to get_atom_parameter out of the inner loop, and you've almost halved the
number of calls:
while($ligand_atom[$x]{'atom_type'}[0]) {
print STDERR $ligand_atom[$x]{'atom_type'}[0];
$y=0;
my $lig_radius = get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');
while($protein_atom[$y]) {
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))
- $lig_radius
- get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius');
$y++;
}
$x++;
print STDERR ".";
}
But there's more. In your test case the ligand has 35 atoms and the
protein 4128 atoms. This means that your initial code made
4128*35*2 = 288960 calls to get_atom_parameter, and while now it's
only 4128*35 + 35 = 144515 calls, it's easy to just make some arrays with
the radii so that it's only 4128 + 35 = 4163 calls:
my $protein_size = $#protein_atom;
my $ligand_size;
{
my $x=0;
$x++ while($ligand_atom[$x]{'atom_type'}[0]);
$ligand_size = $x-1;
}
#print STDERR "protein_size = $protein_size, ligand_size = $ligand_size\n";
my #protein_radius;
for my $y (0..$protein_size) {
$protein_radius[$y] = get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius');
}
my #lig_radius;
for my $x (0..$ligand_size) {
$lig_radius[$x] = get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');
}
for my $x (0..$ligand_size) {
print STDERR $ligand_atom[$x]{'atom_type'}[0];
my $lig_radius = $lig_radius[$x];
for my $y (0..$protein_size) {
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))
- $lig_radius
- $protein_radius[$y]
}
print STDERR ".";
}
And finally, the call to distance_sqared [sic]:
#distance between atoms
sub distance_sqared {
my $dxs = ($_[0]{'x'}-$_[1]{'x'})**2;
my $dys = ($_[0]{'y'}-$_[1]{'y'})**2;
my $dzs = ($_[0]{'z'}-$_[1]{'z'})**2;
return $dxs+$dys+$dzs;
}
This function can usefully be replaced with the following, which uses
multiplication instead of **.
sub distance_sqared {
my $dxs = ($_[0]{'x'}-$_[1]{'x'});
my $dys = ($_[0]{'y'}-$_[1]{'y'});
my $dzs = ($_[0]{'z'}-$_[1]{'z'});
return $dxs*$dxs+$dys*$dys+$dzs*$dzs;
}
Time after all these modifications: 0m53.639s.
More about **: elsewhere you declare
use constant e_math => 2.71828;
and use it thus:
$Gauss1 += e_math ** (-(($d[$x][$y]*2)**2));
The built-in function exp() calculates this for you (in fact, ** is commonly
implemented as x**y = exp(log(x)*y), so each time you are doing this you are
performing an unnecessary logarithm the result of which is just slightly less
than 1 as your constant is only accurate to 6 d.p.). This change would alter
the output very slightly. And again, **2 should be replaced by multiplication.
Anyway, this answer is probably long enough for now, and calculation of d[]
is no longer the bottleneck it was.
Summary: hoist constant values out of loops and functions! Calculating the
same thing repeatedly is no fun at all.
Using any kind of database for this would not help your performance in the
slightest. One thing that might help you though is Inline::C. Perl is
not really built for this kind of intensive computation, and Inline::C
would allow you to easily move performance-critical bits into C while
keeping your existing I/O in Perl.
I would be willing to take a shot at a partial C port. How stable
is this code, and how fast do you want it to be? :)

Putting this in a DB will make it MUCH easier to maintain, scale, expand, etc.... Using a DB can also save you a lot of RAM -- it gets and stores in RAM only the desired result instead of storing ALL values.
With regards to speed it depends. With a text file you take a long time to read all the values into RAM, but once it is loaded, retrieving the values is super fast, faster than querying a DB.
So it depends on how your program is written and what it is for. Do you read all the values ONCE and then run 1000 queries? The TXT file way is probably faster. Do you read all the values every time you make a query (to make sure you have the latest value set) -- then the DB would be faster. Do you 1 query/day? use a DB. etc......

Related

How can I use "map" in "perl" to return a hash reference whose key is looked up from an array reference whose value is looked up from another array?

I've searched other many Stack questions on map however this requirement is particular and well try as I might I cannot quite get the solution I am looking for, or I think that does exist.
This question is simply about performance.
As limited, background, this code segment used in decoding incoming tokens so it's used on every web request and therefore the performance is critical and I know "map" can be used so want to use it.
Here is a trimmed down but nevertheless fully working code segment which I am currently using and works perfectly well:
use strict;
use Data::Dumper qw (Dumper);
my $api_token = { array => [ 'user_id', 'session_id', 'expiry' ], max => 3, name => 'session' };
my $token_got = [ 9923232345812112323, 1111323232000000465, 1002323001752323232 ];
my $rt;
for (my $i=0; $i<scalar #{$api_token->{array}}; $i++) {
$rt->{$api_token->{array}->[$i]} = $token_got->[$i];
}
$rt->{type} = $api_token->{name};
print Dumper ($rt) . "\n";
The question is:
What is the absolute BEST POSSIBLE PERL CODE to replicate the foreach statement above in terms of performance?
Looks like you only need a hash slice
my %rt;
#rt{ #{ $api_token->{array} } } = #$token_got;
Or, if the hash reference is needed
my $rt;
#{ $rt } { #{ $api_token->{array} } } = #$token_got;
or with the newer postfix dereferencing, on both array and hash slices, perhaps a bit nicer
my $rt;
$rt->#{ $api_token->{array}->#* } = #$token_got;
One can also do it using
List::MoreUtils::mesh, and in one statement
my $rt = { mesh #{ $api_token->{array} }, #$token_got };
or with pairwise from the same library
my $rt = { pairwise { $a, $b } #{ $api_token->{array} }, #$token_got };
These go via C code if the library gets installed with List::MoreUtils::XS.
Benchmarked all above, with the tiny datasets from the question (realistic though?), and whatever implementation mesh/pairwise have they are multiple times as slow as the others.
On an old laptop with v5.26
Rate use_pair use_mesh use_href use_post use_hash
use_pair 373639/s -- -36% -67% -67% -68%
use_mesh 580214/s 55% -- -49% -49% -51%
use_href 1129422/s 202% 95% -- -1% -5%
use_post 1140634/s 205% 97% 1% -- -4%
use_hash 1184835/s 217% 104% 5% 4% --
On a server with v5.36 the numbers are around 160%--170% against pairwise (with mesh being a bit faster than it, similarly to above)
Of the others, on the laptop the hash-based one is always a few percent quicker, while on a server with v5.36 they are all very close. Easy to call it a tie.
The following is edit by OP, who timed a 61% speedup (see comments)
CHANGED CODE:
#rt{ #{ $api_token->{array} } } = #$token_got; ### much faster onliner replaced the loop. #zdim credit

How to sort number in file (highest)

There is a file included many information.
I want to sort several sentences with included numbers.
In Files, there are several sentences.
There are 7 lines below (middle line is blank)
GRELUP.C.3a.or:ndiff_c_fail_a_same_well = SELECT -inside GRELUP.C.3a.or:ndiff_c_fail_a GRELUP.C.3a.or:_EPTMPL312066 -not
generate layer GRELUP.C.3a.or:ndiff_c_fail_a_same_well, TYP = P, HPN = 0, FPN = 0, HEN = 0, FEN = 0
Time: cpu=0.00/8818.64 real=0.30/1875.23 Memory: 160.81/245.20/245.20
GRELUP.C.3a.or:ndiff_c_fail_a = SELECT -inside GRELUP.C.3a.or:ndiff_c_eg GRELUP.C.3a.or:well_cont_a_sized_a -not
generate layer GRELUP.C.3a.or:ndiff_c_fail_a, TYP = P, HPN = 0, FPN = 0, HEN = 0, FEN = 0
Time: cpu=0.00/8818.64 real=1.10/1875.23 Memory: 180.84/252.29/252.29
What I want to return line are below.
GRELUP.C.3a.or:ndiff_c_fail_a real=1.10/1875.23
GRELUP.C.3a.or:ndiff_c_fail_a_same_well real=0.30/1875.23
In other words, high number behind of "real=" is sorted first and added specific words behind "generate layer" above line.
I suggest doing this in several stages, as it is so much easier to take things in limited steps:
Split the data into records.
Convert each record into a reduced form that just has the information you want.
Sort now that you can easily determine what to sort by.
(Optional, depending on how you do step 2) Extract the information to print.
If your data is small enough to fit into memory, the first step can be done with:
proc splitIntoRecords {data} {
# U+001E is the official ASCII record separator; it's not used much!
regsub -all {\n{2,}} $data \u001e data
return [split $data \u001e]
}
I'm not quite so sure about the conversion step; this might work (on a single record; I'll lift to the collection with lmap later):
proc convertRecord {record} {
# We extract the parts we want to print and the part we want to sort by
regexp {(^\S+).*(real=[^\s/]+/(\S+))} $record -> name time val
return [list "$name $time" $val]
}
Once that's done, we can lsort -real -decreasing with a -index specified to get the collation key (the $vals we extracted above), and printing is now trivial:
set records [lmap r [splitIntoRecords $data] {convertRecord $r}]
foreach r [lsort -real -decreasing -index 1 $records] {
puts [lindex $r 0]
}

For-loop/ simple data extraction and comparison in R

In this example dataset i have created a column called 'Var'. This is the result i would like from a the code. The pseudo-code to give Var is like this : For each ID_Survey, compare the Distance in sequence, if the difference between sequential Distances is 10, then Var=1, otherwise Var=0. Var should be 1 for both elements of the sequence where the difference is 10.
#Generate data
ID_Survey=rep(seq(1,3,1),each=4)
Distance= c(0,25,30,40,50,160,170,190,200,210,1000,1010)
Var= c(0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1);
TestData=data.frame(cbind(ID_Survey,Distance,Var))
TestData
I can use a simple for-loop like this, which nearly works, but it trips-up when moving between ID_Survey.
for(i in 1:(nrow(TestData)-1)){
TestData$Var2[i]=(TestData$Distance[i+1]==TestData$Distance[i]+10)}
I need to incorporate the above into a function which splits the data.frame into groups based on ID_Survey. I'm trying to build something like the following...
New6=do.call(rbind, by(TestData,list(TestData$ID_Survey),
FUN=function(x)
for (i in nrow(x)){ #loop must build an argument then return it.
#conditional statements in here.
return(x[i,])})); #loop appears to return 1st argument only.
... but i can't get the for-loop to operate inside the by-statement.
Any guidance much appreciated. Many thanks.
Using the data.table function (.SD) manages separating and collating chunks of the data.frame (as defined by ID_Survey) after it has been sent to a function. No doubt someone else will have a more elegant solution, but this seems to do the job:
library(data.table)
ComPair=function(a,b){V=ifelse(a==b-10,TRUE,FALSE);return(V)}
TestFunction=function(FData){
if(nrow(FData)>1){
for(i in 1:(nrow(FData)-1)){
V=ComPair(FData$Distance[i],FData$Distance[i+1])
if(V==1){ FData$Var2[i]=V;FData$Var2[i+1]=V}
}
};return(FData)}
TestData_dt=data.table(TestData)
TestData2=TestData_dt[,TestFunction(.SD),ID_Survey]
TestData2

Random iteration to fill a table in Lua

I'm attempting to fill a table of 26 values randomly. That is, I have a table called rndmalpha, and I want to randomly insert the values throughout the table. This is the code I have:
rndmalpha = {}
for i= 1, 26 do
rndmalpha[i] = 0
end
valueadded = 0
while valueadded = 0 do
a = math.random(1,26)
if rndmalpha[a] == 0 then
rndmalpha[a] = "a"
valueadded = 1
end
end
while valueadded = 0 do
a = math.random(1,26)
if rndmalpha[a] == 0 then
rndmalpha[a] = "b"
valueadded = 1
end
end
...
The code repeats itself until "z", so this is just a general idea. The problem I'm running into, however, is as the table gets filled, the random hits less. This has potential to freeze up the program, especially in the final letters because there are only 2-3 numbers that have 0 as a value. So, what happens if the while loop goes through a million calls before it finally hits that last number? Is there an efficient way to say, "Hey, disregard positions 6, 13, 17, 24, and 25, and focus on filling the others."? For that matter, is there a much more efficient way to do what I'm doing overall?
The algorithm you are using seems pretty non-efficient, it seems to me that all you need is to initialize a table with all alphabet:
math.randomseed(os.time())
local t = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"}
and Then shuffle the elements:
for i = 1, #t*2 do
local a = math.random(#t)
local b = math.random(#t)
t[a],t[b] = t[b],t[a]
end
Swapping the elements for #t*2 times gives randomness pretty well. If you need more randomness, increase the number of shuffling, and use a better random number generator. The random() function provided by the C library is usually not that good.
Instead of randoming for each letter, go through the table once and get something random per position. The method you're using could take forever because you might never hit it.
Never repeat yourself. Never repeat yourself! If you're copy and pasting too often, it's a sure sign something has gone wrong. Use a second table to contain all the possible letters you can choose, and then randomly pick from that.
letters = {"a","b","c","d","e"}
numberOfLetters = 5
rndmalpha = {}
for i in 1,26 do
rndmalpha[i] = letters[math.random(1,numberOfLetters)]
end

Perl fast matrix multiply

I have implemented the following statistical computation in perl http://en.wikipedia.org/wiki/Fisher_information.
The results are correct. I know this because I have 100's of test cases that match input and output. The problem is that I need to compute this many times every single time I run the script. The average number of calls to this function is around 530. I used Devel::NYTProf to find out this out as well as where the slow parts are. I have optimized the algorithm to only traverse the top half of the matrix and reflect it onto the bottom as they are the same. I'm not a perl expert, but I need to know if there is anything I can try to speed up the perl. This script is distributed to clients so compiling a C file is not an option. Is there another perl library I can try? This needs to be sub second in speed if possible.
More information is $MatrixRef is a matrix of floating point numbers that is $rows by $variables. Here is the NYTProf dump for the function.
#-----------------------------------------------
#
#-----------------------------------------------
sub ComputeXpX
# spent 4.27s within ComputeXpX which was called 526 times, avg 8.13ms/call:
# 526 times (4.27s+0s) by ComputeEfficiency at line 7121, avg 8.13ms/call
{
526 0s my ($MatrixRef, $rows, $variables) = #_;
526 0s my $r = 0;
526 0s my $c = 0;
526 0s my $k = 0;
526 0s my $sum = 0;
526 0s my #xpx = ();
526 11.0ms for ($r = 0; $r < $variables; $r++)
{
14202 19.0ms my #temp = (0) x $variables;
14202 6.01ms push(#xpx, \#temp);
526 0s }
526 7.01ms for ($r = 0; $r < $variables; $r++)
{
14202 144ms for ($c = $r; $c < $variables; $c++)
{
198828 43.0ms $sum = 0;
#for ($k = 0; $k < $rows; $k++)
198828 101ms foreach my $RowRef (#{$MatrixRef})
{
#$sum += $MatrixRef->[$k]->[$r]*$MatrixRef->[$k]->[$c];
6362496 3.77s $sum += $RowRef->[$r]*$RowRef->[$c];
}
198828 80.1ms $xpx[$r]->[$c] = $sum;
#reflect on other side of matrix
198828 82.1ms $xpx[$c]->[$r] = $sum if ($r != $c);
14202 1.00ms }
526 2.00ms }
526 2.00ms return \#xpx;
}
Since each element of the result matrix can be calculated independently, it should be possible to calculate some/all of them in parallel. In other words, none of the instances of the innermost loop depend on the results of any other, so they could run simultaneously on their own threads.
There really isn't much you can do here, without rewriting parts in C, or moving to a better framework for mathematic operations than bare-bone Perl (→ PDL!).
Some minor optimization ideas:
You initialize #xpx with arrayrefs containing zeros. This is unneccessary, as you assign a value to every position either way. If you want to pre-allocate array space, assign to the $#array value:
my #array;
$#array = 100; # preallocate space for 101 scalars
This isn't generally useful, but you can benchmark with and without.
Iterate over ranges; don't use C-style for loops:
for my $c ($r .. $variables - 1) { ... }
Perl scalars aren't very fast for math operations, so offloading the range iteration to lower levels will gain a speedup.
Experiment with changing the order of the loops, and toy around with caching a level of array accesses. Keeping $my $xpx_r = $xpx[$r] around in a scalar will reduce the number of array accesses. If your input is large enough, this translates into a speed gain. Note that this only works when the cached value is a reference.
Remember that perl does very few “big” optimizations, and that the opcode tree produced by compilation closely resembles your source code.
Edit: On threading
Perl threads are heavyweight beasts that literally clone the current interpreter. It is very much like forking.
Sharing data structures across thread boundaries is possible (use threads::shared; my $variable :shared = "foo") but there are various pitfalls. It is cleaner to pass data around in a Thread::Queue.
Splitting the calculation of one product over multiple threads could end up with your threads doing more communication than calculation. You could benchmark a solution that divides responsibility for certain rows between the threads. But I think recombining the solutions efficiently would be difficult here.
More likely to be useful is to have a bunch of worker threads running from the beginning. All threads listen to a queue which contains a pair of a matrix and a return queue. The worker would then dequeue a problem, and send back the solution. Multiple calculations could be run in parallel, but a single matrix multiplication will be slower. Your other code would have to be refactored significantly to take advantage of the parallelism.
Untested code:
use strict; use warnings; use threads; use Thread::Queue;
# spawn worker threads:
my $problem_queue = Thread::Queue->new;
my #threads = map threads->new(\&worker, $problem_queue), 1..3; # make 3 workers
# automatically close threads when program exits
END {
$problem_queue->enqueue((undef) x #threads);
$_->join for #threads;
}
# This is the wrapper around the threading,
# and can be called exactly as ComputeXpX
sub async_XpX {
my $return_queue = Thread::Queue->new();
$problem_queue->enqueue([$return_queue, #_]);
return sub { $return_queue->dequeue };
}
# The main loop of worker threads
sub worker {
my ($queue) = #_;
while(defined(my $problem = $queue->dequeue)) {
my ($return, #args) = #$problem;
$return->enqueue(ComputeXpX(#args));
}
}
sub ComputeXpX { ... } # as before
The async_XpX returns a coderef that will eventually collect the result of the computation. This allows us to carry on with other stuff until we need the result.
# start two calculations
my $future1 = async_XpX(...);
my $future2 = async_XpX(...);
...; # do something else
# collect the results
my $result1 = $future1->();
my $result2 = $future2->();
I benchmarked the bare-bones threading code without doing actual calculations, and the communication is about as expensive as the calculations. I.e. with a bit of luck, you may start to get a benefit on a machine with at least four processors/kernel threads.
A note on profiling threaded code: I know of no way to do that elegantly. Benchmarking threaded code, but profiling with single-threaded test cases may be preferable.

Resources