How can this perl sub be optimised for speed? - performance

The following perl sub is used to store arrays of hashes.
Each hash to be stored is first checked for uniqueness using a given key, if a hash exists on the array with the same key value then it's not stored.
How can this perl sub be optimised for speed?
Example use:
my #members;
...
$member= {};
$hash->{'name'}='James';
hpush('name', \#members,$member);
The sub:
sub hpush {
# push a set of key value pairs onto an array as a hash, if the key doesn't already exist
if (#_ != 3) {
print STDERR "hpush requires 3 args, ".#_." given\n";
return;
}
my $uniq = shift;
my $rarray = shift;
my $rhash = shift;
my $hash = ();
#print "\nHash:\n";
for my $key ( keys %{$rhash} ) {
my $valuea = $rhash->{$key};
#print "key: $key\n";
#print "key=>value: $key => $valuea\n";
$hash->{ $key} = $valuea;
}
#print "\nCurrent Array:\n";
for my $node (#{$rarray}) {
#print "node: $node \n";
for my $key ( keys %{$node} ) {
my $valueb = $node->{$key};
#print "key=>value: $key => $valueb\n";
if ($key eq $uniq) {
#print "key=>value: $key => $valueb\n";
if (($valueb =~ m/^[0-9]+$/) && ($hash->{$key} == $valueb)) {
#print "Not pushing i $key -> $valueb\n";
return;
} elsif ($hash->{$key} eq $valueb) {
#print "Not pushing s $key -> $valueb\n";
return;
}
}
}
}
push #{$rarray}, $hash;
#print "Pushed\n";
}
Note that the perl isn't mine and I'm a perl beginner

This code is rather... not very efficient. First, it copies $rhash to $hash, with a for loop... for some reason. Then it loops through the hash keys, instead of simply using the hash key that it's looking for. Then it does two equivalent checks, apparently some attempt to distinguish numbers from non-numbers and selecting the appropriate check (== or eq). This is all unnecessary.
This code below should be roughly equivalent. I've trimmed it down hard. This should be as fast as it is possible to get it.
use strict;
use warnings;
hpush('name', \#members,$member);
sub hpush {
my ($uniq, $rarray, $rhash) = #_;
for my $node (#{$rarray}) {
if (exists $node->{$uniq}) {
return if ($node->{$uniq} eq $rhash->{$uniq});
}
}
push #{$rarray}, $rhash;
}

Related

Perl6 : What is the best way for dealing with very big files?

Last week I decided to give a try to Perl6 and started to reimplement one of my program.
I have to say, Perl6 is so the easy for object programming, an aspect very painfull to me in Perl5.
My program have to read and store big files, such as whole genomes (up to 3 Gb and more, See example 1 below) or tabulate data.
The first version of the code was made in the Perl5 way by iterating line by line ("genome.fa".IO.lines). It was very slow and unsable for a correct execution time.
my class fasta {
has Str $.file is required;
has %!seq;
submethod TWEAK() {
my $id;
my $s;
for $!file.IO.lines -> $line {
if $line ~~ /^\>/ {
say $id;
if $id.defined {
%!seq{$id} = sequence.new(id => $id, seq => $s);
}
my $l = $line;
$l ~~ s:g/^\>//;
$id = $l;
$s = "";
}
else {
$s ~= $line;
}
}
%!seq{$id} = sequence.new(id => $id, seq => $s);
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
So after a little bit of RTFM, I changed for a slurp on the file, a split on the \n which I parsed with a for loop. This way I managed to load the data in 2 min. Much better but not enough. By cheating, I mean by removing a maximum of \n (Example 2), I decreased the execution time to 30 seconds. Quite good, but not totaly satisfied, by this fasta format is not the most used.
my class fasta {
has Str $.file is required;
has %!seq;
submethod TWEAK() {
my $id;
my $s;
say "Slurping ...";
my $f = $!file.IO.slurp;
say "Spliting file ...";
my #lines = $f.split(/\n/);
say "Parsing lines ...";
for #lines -> $line {
if $line !~~ /^\>/ {
$s ~= $line;
}
else {
say $id;
if $id.defined {
%!seq{$id} = seq.new(id => $id, seq => $s);
}
$id = $line;
$id ~~ s:g/^\>//;
$s = "";
}
}
%!seq{$id} = seq.new(id => $id, seq => $s);
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
So RTFM again and I discovered the magic of Grammar. So new version and an execution time of 45 seconds whatever the fasta format used. Not the fastest way but more elegant and stable.
my grammar fastaGrammar {
token TOP { <fasta>+ }
token fasta {<.ws><header><seq> }
token header { <sup><id>\n }
token sup { '>' }
token id { <[\d\w]>+ }
token seq { [<[ACGTNacgtn]>+\n]+ }
}
my class fastaActions {
method TOP ($/){
my #seqArray;
for $<fasta> -> $f {
#seqArray.push: seq.new(id => $f.<header><id>.made, seq => $f<seq>.made);
}
make #seqArray;
}
method fasta ($/) { make ~$/; }
method id ($/) { make ~$/; }
method seq ($/) { make $/.subst("\n", "", :g); }
}
my class fasta {
has Str $.file is required;
has %seq;
submethod TWEAK() {
say "=> Slurping ...";
my $f = $!file.IO.slurp;
say "=> Grammaring ...";
my #seqArray = fastaGrammar.parse($f, actions => fastaActions).made;
say "=> Storing data ...";
for #seqArray -> $s {
%!seq{$s.id} = $s;
}
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
I think that I found good solution to handle these kind of big files, but performances are still under those of Perl5.
As a newbie in Perl6, I would be interested to know if there is better ways to deal with big data or if there is some limitation due to the Perl6 implementation ?
As a newbie in Perl6, I would ask two questions :
Is there other Perl6 mechanisms that I'm not aware yet, or not yet
documented, for storing huge data from a file (like my genomes) ?
Did I reach the maximum performances for the current version of
Perl6 ?
Thanks for reading !
Fasta Example 1 :
>2L
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...
>3R
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...
Fasta example 2 :
>2L
GACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT...
>3R
TAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCT...
EDIT
I applied advises of #Christoph and #timotimo and test with code:
my class fasta {
has Str $.file is required;
has %!seq;
submethod TWEAK() {
say "=> Slurping / Parsing / Storing ...";
%!seq = slurp($!file, :enc<latin1>).split('>').skip(1).map: {
.head => seq.new(id => .head, seq => .skip(1).join) given .split("\n").cache;
}
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
The program finished in 2.7s, which is so great !
I also tried this code on the wheat genome (10 Gb). It finished in 35.2s.
Perl6 is not so slow finally !
Big Thank for the help !
One simple improvement is to use a fixed-width encoding such as latin1 to speed up character decoding, though I'm not sure how much this will help.
As far as Rakudo's regex/grammar engine is concerned, I've found it to be pretty slow, so it might indeed be necessary to take a more low-level approach.
I did not do any benchmarking, but what I'd try first is something like this:
my %seqs = slurp('genome.fa', :enc<latin1>).split('>')[1..*].map: {
.[0] => .[1..*].join given .split("\n");
}
As the Perl6 standard library is implemented in Perl6 itself, it is sometimes possible to improve performance by just avoiding it, writing code in an imperative style such as this:
my %seqs;
my $data = slurp('genome.fa', :enc<latin1>);
my $pos = 0;
loop {
$pos = $data.index('>', $pos) // last;
my $ks = $pos + 1;
my $ke = $data.index("\n", $ks);
my $ss = $ke + 1;
my $se = $data.index('>', $ss) // $data.chars;
my #lines;
$pos = $ss;
while $pos < $se {
my $end = $data.index("\n", $pos);
#lines.push($data.substr($pos..^$end));
$pos = $end + 1
}
%seqs{$data.substr($ks..^$ke)} = #lines.join;
}
However, if the parts of the standard library used has seen some performance work, this might actually make things worse. In that case, the next step to take would be adding low-level type annotations such as str and int and replacing calls to routines such as .index with NQP builtins such as nqp::index.
If that's still too slow, you're out of luck and will need to switch languages, eg calling into Perl5 by using Inline::Perl5 or C using NativeCall.
Note that #timotimo has done some performance measurements and wrote an article about it.
If my short version is the baseline, the imperative version improves performance by 2.4x.
He actually managed to squeeze a 3x improvement out of the short version by rewriting it to
my %seqs = slurp('genome.fa', :enc<latin-1>).split('>').skip(1).map: {
.head => .skip(1).join given .split("\n").cache;
}
Finally, rewriting the imperative version using NQP builtins sped things up by a factor of 17x, but given potential portability issues, writing such code is generally discouraged, but may be necessary for now if you really need that level of performance:
use nqp;
my Mu $seqs := nqp::hash();
my str $data = slurp('genome.fa', :enc<latin1>);
my int $pos = 0;
my str #lines;
loop {
$pos = nqp::index($data, '>', $pos);
last if $pos < 0;
my int $ks = $pos + 1;
my int $ke = nqp::index($data, "\n", $ks);
my int $ss = $ke + 1;
my int $se = nqp::index($data ,'>', $ss);
if $se < 0 {
$se = nqp::chars($data);
}
$pos = $ss;
my int $end;
while $pos < $se {
$end = nqp::index($data, "\n", $pos);
nqp::push_s(#lines, nqp::substr($data, $pos, $end - $pos));
$pos = $end + 1
}
nqp::bindkey($seqs, nqp::substr($data, $ks, $ke - $ks), nqp::join("", #lines));
nqp::setelems(#lines, 0);
}

How to take out certain elements from a pdb file

I am trying to take out certain columns from a pdb file. I already have taken out all lines that start out with ATOM in my code. For some reason my sub functions are not working and I do not know where or how to call them.
My code is:
open (FILE, $ARGV[0])
or die "Could not open file\n";
my #newlines;
while ( my $line = <FILE> ) {
if ($line =~ m/^ATOM.*/) {
push #newlines, $line;
}
}
my $atomcount = #newlines;
#print "#newlines\n";
#print "$atomcount\n";
##############################################################
#This function will take out the element from each line
#The element is from column 77 and contains one or two letters
sub atomfreq {
foreach my $record1(#newlines) {
my $element = substr($record1, 76, 2);
print "$element\n";
return;
}
}
################################################################
#This function will take out the residue name from each line
#The element is from column 18 and contains 3 letters
sub resfreq {
foreach my $record2(#newlines) {
my $residue = substr($record2, 17, 3);
print "$residue\n";
return;
}
}
As #Ossip already said in this answer you simply need to call your functions:
sub atomfreq {
...
}
sub resfreq {
...
}
atomfreq();
resfreq();
But I'm not sure whether these functions do what you intended because the comments imply that they should print every $residue and $element from the #newlines array. You've put a return statement inside the for loop which will immediately return from the whole function (and its for loop) so it will print only the first $residue or $element. Because the functions aren't supposed to return anything you can just drop that statement:
sub atomfreq {
foreach my $record1(#newlines) {
my $element = substr($record1, 76, 2);
print "$element\n";
}
}
sub resfreq {
foreach my $record2(#newlines) {
my $residue = substr($record2, 17, 3);
print "$residue\n";
}
}
atomfreq();
resfreq();
You can just call them right under your other code like this:
atomfreq();
resfreq();

Open Reading Frame program not printing Amino Acid Sequences

I am working on a program that will be able to read a gene sequence and give me the Open Reading Frames (ORF) and then the protein sequence of each ORF. I have already gotten the code to work for finding the ORFs- but no amino acids will print. I am using Perl on my Mac.
I would like to get the code to tell me the string of amino acids produced from the open reading frames.
Here is my code:
#!/usr/bin/perl
#ORF_Find.txt -> finds long orfs in a DNA sequence
open(CHROM, "chr03.txt"); #Open file chr03.txt containing yeastchrom. 3
$DNA = ""; #start with empty DNA sequence
$header = <CHROM>; #get header of sequence
#Read line from file, join to end of $DNA, repeat until end of file
while ($current_line = <CHROM>)
{
chomp($current_line); #remove newline from end of current_line
$DNA= $DNA . $current_line;
}
#length of DNA sequence
$DNA_length = length($DNA);
#flag for ORF finder
$inORF=0;
#number of ORFs found
$numORFs = 0;
#minimum length
$minimum_codons =100;
#search each reading frame
for ($frame =0; $frame<3; $frame++)
{
print "\nFinding ORFs in frame: +" . ($frame + 1) . "\n";
#search for sequence match and print position of match if found
for ($i =frame; $i<=($DNA_length-3);$i += 3)
{
#get current codon from sequence
$codon= substr ($DNA, $i, 3);
#if not in orf search for ATG, else search for stop codon
if ($inORF == 0)
{
#if current codon is ATG, start ORF
if ($codon eq "ATG")
{
$inORF = 1;
$ORF_length = 1;
$ORF_start = $i;
}
}
elsif($inORF ==1)
{
#if current codon is a stop codon, end ORF
if ($codon eq "TGA" || $codon eq "TAG" || $codon eq "TAA")
{
#if ORF has at least min number of codons,print location
if ($ORF_length >= $minimum_codons)
{
print "FOUND ORF AT POSITION $ORF_start,";
print "length = $ORF_length\n";
$numORFs++;
}
#reset ORF variables
$inORF = 0;
$ORF_length = 0;
}
else
{
#increase length of ORF by one codon
$ORF_length++;
}
}
}
}
#change T to U
$DNA =~ s/T/U/g;
#search each ORF
for ($i=$ORF_start; $i<=($ORF_length-3); $i+=3)
{
#get codon from each ORF
$aa_codon= substr($DNA, $i, 3);
#find amino acids
foreach ($aa_codon eq "ATG")
{
print ("M") #METHIONINE
}
foreach ($aa_codon =~/UU[UC]/)
{
print ("F") #PHENYLALANINE
}
foreach ($aa_codon =~/UU[AG]/ || $aa_codon=~/CU[UCAG]/)
{
print ("L"); #LEUCINE
}
foreach ($aa_codon =~/AU[UAC]/)
{
print ("I"); #ISOLEUCINE
}
foreach ($aa_codon =~/GU[UACG]/)
{
print ("V"); #VALINE
}
foreach ($aa_codon =~/UC[UCAG]/ || $aa_codon=~/AG[UC]/)
{
print ("S"); #SERINE
}
foreach ($aa_codon =~/CC[UCAG]/)
{
print ("P"); #PROLINE
}
foreach ($aa_codon =~/AC[UCAG]/)
{
print ("T"); #THREONINE
}
foreach ($aa_codon =~/GC[UCAG]/)
{
print ("A"); #ALANINE
}
foreach ($aa_codon =~/UA[UC]/)
{
print ("Y"); #TYROSINE
}
foreach ($aa_codon =~/CA[UC]/)
{
print ("H"); #HISTIDINE
}
foreach ($aa_codon =~/CA[AG]/)
{
print ("G"); #GLUTAMINE
}
foreach ($aa_codon =~/AA[UC]/)
{
print ("N"); #ASPARAGINE
}
foreach ($aa_codon =~/AA[AG]/)
{
print ("K"); #LYSINE
}
foreach ($aa_codon =~/GA[UC]/)
{
print ("D"); #ASPARTIC ACID
}
foreach ($aa_codon =~/GA[AG]/)
{
print ("E"); #GLUTAMIC ACID
}
foreach ($aa_codon =~/UG[UC]/)
{
print ("C"); #CYSTINE
}
foreach ($aa_codon eq "UGG")
{
print ("W"); #TRYPTOPHAN
}
foreach ($aa_codon =~/AG[AG]/ || $aa_codon =~/CG[UCAG]/)
{
print ("R"); #ARGININE
}
foreach ($aa_codon =~/GG[UCAG]/)
{
print ("G"); #GLYCINE
}
foreach ($aa_codon =~/UA[AG]/|| $aa_codon eq "UGA")
{
print ("*") #STOP
}
}
#if no ORFS found, print message
if ($numORFs ==0)
{
print ("NO ORFS FOUND\n");
}
else
{
print ("\n$num_ORFs ORFS WERE FOUND\n");
}
First, this question would probably be more appropriate for a forum such as seqAnswers or BioStars. That aside, writing your own 6-frame translation script is a complex task, especially if you want to account for IUPAC ambiguous nucleotides. There are already lots of scripts and tools out there that do this. Probably the easiest suggestion I can make is to use one of the existing tools. Try mine, for example:
https://github.com/hepcat72/sixFrameTranslation/archive/master.zip
My script wasn't public until just now. I have opened it up so that you can use it. Just run it to get a usage.
Other than that, if you want to get your version running properly, the first thing you can do is change your she-bang to:
#!/usr/bin/perl -w
Note the -w. Then, add this line to the top of your script:
use strict;
It will help you debug issues such as the missing dollar sign in one of your for loops:
for ($i =frame; $i<=($DNA_length-3);$i += 3)
It should be:
for ($i =$frame; $i<=($DNA_length-3);$i += 3)
And BTW, it doesn't matter that you're running perl on your Mac. It's just perl. "Mac perl" was a project to create a perl environment back in the pre-OS-X days.

Why iterating with $_ takes longer

I was wondering based on many books on Internet, if $_ is really faster way of iterating through array (no instantiating of new variable), but somehow I always get different results. Here's the performance code test:
#!/usr/bin/perl
use Time::HiRes qw(time);
use strict;
use warnings;
# $_ is a default argument for many operators, and also for some control structures.
my $test_array = [1..1000000];
my $number_of_tests = 100;
my $dollar_wins = 0;
my $dollar_wins_sum = 0;
for (my $i = 1; $i <= $number_of_tests; $i++) {
my $odd_void_array = [];
my $start_time_1 = time();
foreach my $item (#{$test_array}) {
if ($item % 2 == 1) {
push (#{$odd_void_array}, $item);
}
}
foreach my $item_odd (#{$odd_void_array}) {
}
my $end_time_1 = time();
$odd_void_array = [];
my $start_time_2 = time();
foreach (#{$test_array}) {
if ($_ % 2 == 1) {
push (#{$odd_void_array}, $_);
}
}
foreach (#{$odd_void_array}) {
}
my $end_time_2 = time();
my $diff = ($end_time_1-$start_time_1) - ($end_time_2-$start_time_2);
if ($diff > 0) {
$dollar_wins ++;
$dollar_wins_sum += $diff;
print "Dollar won ($dollar_wins out of $i) with diff $diff \n";
}
}
print "=================================\n";
print "When using dollar underscore, execution was faster in $dollar_wins cases (".(($dollar_wins/$number_of_tests)*100)."%), with average difference of ".($dollar_wins_sum/$dollar_wins)."\n";
So, I have twice iterating (once with assigning to my $item, other without). I get mostly that iterating with $_ was faster in about 20-30% cases.
Shouldn't be iterating without new variable be faster?
You aren't really benchmarking iteration with different variables.
Your timings includes array creation and other calculations.
You only tell which is faster, not by how much.
You have too few iterations to tell anything reliable.
Let's take this better test that actually benchmarks what you are claiming to benchmark:
use strict;
use warnings;
use Benchmark ':hireswallclock', 'cmpthese';
my #numbers = 1..100_000;
cmpthese -3, {
'$_' => sub {
for (#numbers) {
1;
}
},
'my $x' => sub {
for my $x (#numbers) {
1;
}
},
'$x' => sub {
my $x;
for $x (#numbers) {
1;
}
},
}
Result:
Rate $_ my $x $x
$_ 107/s -- -0% -0%
my $x 107/s 0% -- -0%
$x 108/s 0% 0% --
So they are equally fast on my test system (perl 5.18.2 built for i686-linux-thread-multi-64int).
My suspicion is that using $_ is slightly slower than a lexical, as it's a global variable. However, the speed of iteration is equivalent. Indeed, modifying the benchmark…
use strict;
use warnings;
use Benchmark ':hireswallclock', 'cmpthese';
my #numbers = 1..100_000;
cmpthese -3, {
'$_' => sub {
for (#numbers) {
$_ % 2 == 0;
}
},
'my $x' => sub {
for my $x (#numbers) {
$x % 2 == 0;
}
},
'$x' => sub {
my $x;
for $x (#numbers) {
$x % 2 == 0;
}
},
}
… gives
Rate $_ $x my $x
$_ 40.3/s -- -1% -6%
$x 40.6/s 1% -- -5%
my $x 42.9/s 7% 6% --
but the effects are still too small to draw any solid conclusion.

Why does password input in Perl using Win32::Console require pressing Enter twice?

I am fairly new to Perl and am having a hard time grasping the behavior of the following password input code snippet:
use Win32::Console;
my $StdIn = new Win32::Console(STD_INPUT_HANDLE);
my $Password = "";
$StdIn->Mode(ENABLE_PROCESSED_INPUT());
local $| = 1;
print "Enter Password: ";
while (my $Data = $StdIn->InputChar(1)) {
if ("\r" eq $Data ) {
last;
}
elsif ("\ch" eq $Data ) {
if ( "" ne chop( $Password )) {
print "\ch \ch";
}
next;
}
$Password .=$Data;
print "*";
}
while (my $Data = $StdIn->InputChar(1)) {
print "\nShow password? [y/n] ";
if ("n" eq $Data) {
last;
}
elsif ("y" eq $Data) {
print "\nPassword: $Password\n";
last;
}
}
Basically what happens is that the script prompts the user for a password and displays * for every character input as expected but requires Enter to be pressed twice to accept the input. However, if I delete the second while loop (or replace with a print $password statement) the input only requires one press of Enter.
I have also noticed that in the second while loop, which prompts the user to enter y or n (without needing to press Enter) if the user enters 'y' then the line Show password? [y/n] is repeated before displaying the password.
Some insight on this behavior would be appreciated.
The first Enter gets you out of the first while loop. The second while loop then waits for another character before displaying the prompt. You should display the prompt before asking for another character (and display it only once).
Breaking things into subroutines helps build on basic blocks.
use strict; use warnings;
use Win32::Console;
run();
sub run {
my $StdIn = Win32::Console->new(STD_INPUT_HANDLE);
$StdIn->Mode(ENABLE_PROCESSED_INPUT);
my $Password = prompt_password($StdIn, "Enter Password: ", '*');
if ( prompt_echo($StdIn, "\nShow password? [y/n] ") ) {
print "\nPassword = $Password\n"
}
return;
}
sub prompt_password {
my ($handle, $prompt, $mask) = #_;
my ($Password);
local $| = 1;
print $prompt;
$handle->Flush;
while (my $Data = $handle->InputChar(1)) {
last if "\r" eq $Data;
if ("\ch" eq $Data ) {
if ( "" ne chop( $Password )) {
print "\ch \ch";
}
next;
}
$Password .= $Data;
print $mask;
}
return $Password;
}
sub prompt_echo {
my ($handle, $prompt) = #_;
local $| = 1;
print $prompt;
$handle->Flush;
while (my $Data = $handle->InputChar(1)) {
return if "n" eq $Data;
return 1 if "y" eq $Data;
}
return;
}

Resources