Last week I decided to give a try to Perl6 and started to reimplement one of my program.
I have to say, Perl6 is so the easy for object programming, an aspect very painfull to me in Perl5.
My program have to read and store big files, such as whole genomes (up to 3 Gb and more, See example 1 below) or tabulate data.
The first version of the code was made in the Perl5 way by iterating line by line ("genome.fa".IO.lines). It was very slow and unsable for a correct execution time.
my class fasta {
has Str $.file is required;
has %!seq;
submethod TWEAK() {
my $id;
my $s;
for $!file.IO.lines -> $line {
if $line ~~ /^\>/ {
say $id;
if $id.defined {
%!seq{$id} = sequence.new(id => $id, seq => $s);
}
my $l = $line;
$l ~~ s:g/^\>//;
$id = $l;
$s = "";
}
else {
$s ~= $line;
}
}
%!seq{$id} = sequence.new(id => $id, seq => $s);
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
So after a little bit of RTFM, I changed for a slurp on the file, a split on the \n which I parsed with a for loop. This way I managed to load the data in 2 min. Much better but not enough. By cheating, I mean by removing a maximum of \n (Example 2), I decreased the execution time to 30 seconds. Quite good, but not totaly satisfied, by this fasta format is not the most used.
my class fasta {
has Str $.file is required;
has %!seq;
submethod TWEAK() {
my $id;
my $s;
say "Slurping ...";
my $f = $!file.IO.slurp;
say "Spliting file ...";
my #lines = $f.split(/\n/);
say "Parsing lines ...";
for #lines -> $line {
if $line !~~ /^\>/ {
$s ~= $line;
}
else {
say $id;
if $id.defined {
%!seq{$id} = seq.new(id => $id, seq => $s);
}
$id = $line;
$id ~~ s:g/^\>//;
$s = "";
}
}
%!seq{$id} = seq.new(id => $id, seq => $s);
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
So RTFM again and I discovered the magic of Grammar. So new version and an execution time of 45 seconds whatever the fasta format used. Not the fastest way but more elegant and stable.
my grammar fastaGrammar {
token TOP { <fasta>+ }
token fasta {<.ws><header><seq> }
token header { <sup><id>\n }
token sup { '>' }
token id { <[\d\w]>+ }
token seq { [<[ACGTNacgtn]>+\n]+ }
}
my class fastaActions {
method TOP ($/){
my #seqArray;
for $<fasta> -> $f {
#seqArray.push: seq.new(id => $f.<header><id>.made, seq => $f<seq>.made);
}
make #seqArray;
}
method fasta ($/) { make ~$/; }
method id ($/) { make ~$/; }
method seq ($/) { make $/.subst("\n", "", :g); }
}
my class fasta {
has Str $.file is required;
has %seq;
submethod TWEAK() {
say "=> Slurping ...";
my $f = $!file.IO.slurp;
say "=> Grammaring ...";
my #seqArray = fastaGrammar.parse($f, actions => fastaActions).made;
say "=> Storing data ...";
for #seqArray -> $s {
%!seq{$s.id} = $s;
}
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
I think that I found good solution to handle these kind of big files, but performances are still under those of Perl5.
As a newbie in Perl6, I would be interested to know if there is better ways to deal with big data or if there is some limitation due to the Perl6 implementation ?
As a newbie in Perl6, I would ask two questions :
Is there other Perl6 mechanisms that I'm not aware yet, or not yet
documented, for storing huge data from a file (like my genomes) ?
Did I reach the maximum performances for the current version of
Perl6 ?
Thanks for reading !
Fasta Example 1 :
>2L
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...
>3R
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...
Fasta example 2 :
>2L
GACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT...
>3R
TAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCT...
EDIT
I applied advises of #Christoph and #timotimo and test with code:
my class fasta {
has Str $.file is required;
has %!seq;
submethod TWEAK() {
say "=> Slurping / Parsing / Storing ...";
%!seq = slurp($!file, :enc<latin1>).split('>').skip(1).map: {
.head => seq.new(id => .head, seq => .skip(1).join) given .split("\n").cache;
}
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
The program finished in 2.7s, which is so great !
I also tried this code on the wheat genome (10 Gb). It finished in 35.2s.
Perl6 is not so slow finally !
Big Thank for the help !
One simple improvement is to use a fixed-width encoding such as latin1 to speed up character decoding, though I'm not sure how much this will help.
As far as Rakudo's regex/grammar engine is concerned, I've found it to be pretty slow, so it might indeed be necessary to take a more low-level approach.
I did not do any benchmarking, but what I'd try first is something like this:
my %seqs = slurp('genome.fa', :enc<latin1>).split('>')[1..*].map: {
.[0] => .[1..*].join given .split("\n");
}
As the Perl6 standard library is implemented in Perl6 itself, it is sometimes possible to improve performance by just avoiding it, writing code in an imperative style such as this:
my %seqs;
my $data = slurp('genome.fa', :enc<latin1>);
my $pos = 0;
loop {
$pos = $data.index('>', $pos) // last;
my $ks = $pos + 1;
my $ke = $data.index("\n", $ks);
my $ss = $ke + 1;
my $se = $data.index('>', $ss) // $data.chars;
my #lines;
$pos = $ss;
while $pos < $se {
my $end = $data.index("\n", $pos);
#lines.push($data.substr($pos..^$end));
$pos = $end + 1
}
%seqs{$data.substr($ks..^$ke)} = #lines.join;
}
However, if the parts of the standard library used has seen some performance work, this might actually make things worse. In that case, the next step to take would be adding low-level type annotations such as str and int and replacing calls to routines such as .index with NQP builtins such as nqp::index.
If that's still too slow, you're out of luck and will need to switch languages, eg calling into Perl5 by using Inline::Perl5 or C using NativeCall.
Note that #timotimo has done some performance measurements and wrote an article about it.
If my short version is the baseline, the imperative version improves performance by 2.4x.
He actually managed to squeeze a 3x improvement out of the short version by rewriting it to
my %seqs = slurp('genome.fa', :enc<latin-1>).split('>').skip(1).map: {
.head => .skip(1).join given .split("\n").cache;
}
Finally, rewriting the imperative version using NQP builtins sped things up by a factor of 17x, but given potential portability issues, writing such code is generally discouraged, but may be necessary for now if you really need that level of performance:
use nqp;
my Mu $seqs := nqp::hash();
my str $data = slurp('genome.fa', :enc<latin1>);
my int $pos = 0;
my str #lines;
loop {
$pos = nqp::index($data, '>', $pos);
last if $pos < 0;
my int $ks = $pos + 1;
my int $ke = nqp::index($data, "\n", $ks);
my int $ss = $ke + 1;
my int $se = nqp::index($data ,'>', $ss);
if $se < 0 {
$se = nqp::chars($data);
}
$pos = $ss;
my int $end;
while $pos < $se {
$end = nqp::index($data, "\n", $pos);
nqp::push_s(#lines, nqp::substr($data, $pos, $end - $pos));
$pos = $end + 1
}
nqp::bindkey($seqs, nqp::substr($data, $ks, $ke - $ks), nqp::join("", #lines));
nqp::setelems(#lines, 0);
}
I am asked to do the perl program to find a value(from user input) in array. If matched "its ok". If not matched, then check within the value in the index[0] to index[1] ... index[n]. So then if the value matched to the between two elements then report which is near to these elements might be index[0] or index[1].
Let you explain.
Given array : 10 15 20 25 30;
Get the value from user : 14 (eg.)
Hence 14 matched with in the two elements that is 10(array[0]) - 15(array[1])
Ultimately the check point is do not use more than one for loop and never use the while loop. You need to check one for loop and many of if conditions.
I got the output by which I did here is:
use strict;
use warnings;
my #arr1 = qw(10 15 20 25 30);
my $in = <STDIN>;
chomp($in);
if(grep /$in/, #arr1)
{ } #print "S: $in\n"; }
else
{
for(my $i=0; $i<scalar(#arr1); $i++)
{
my $j = $i + 1;
if($in > $arr1[$i] && $in < $arr1[$j])
{
#print "SN: $arr1[$i]\t$arr1[$j]\n";
my ($inc, $dec) = "0";
my $chk1 = $arr1[$i] + 1;
AGAIN1:
if($in == $chk1)
{ }
else
{ $chk1++; $inc++; goto AGAIN1; }
my $chk2 = $arr1[$j] - 1;
AGAIN2:
if($in == $chk2){ }
else
{ $chk2--; $dec++; goto AGAIN2; }
if($inc > $dec)
{ print "Matched value nearest to $arr1[$j]\n"; }
elsif($inc < $dec)
{ print "Matched value nearest to $arr1[$i]\n"; }
}
}
}
However my question is there a way in algorithm?. Hence if someone can help on this one and it would be appreciated.
Thanks in advance.
You seem determined to make this as complicated as possible :-)
Your specification isn't completely clear, but I think this does what you want:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my #array = qw[10 15 20 25 30];
chomp(my $in = <STDIN>);
if ($in < $array[0]) {
say "$in is less than first element in the array";
exit;
}
if ($in > $array[-1]) {
say "$in is greater than last element in the array";
exit;
}
for (0 .. $#array) {
if ($in == $array[$_]) {
say "$in is in the array";
exit;
}
if ($in < $array[$_]) {
if ($in - $array[$_ - 1] < $array[$_] - $in) {
say "$in is closest to $array[$_ - 1]";
} else {
say "$in is closest to $array[$_]";
}
exit;
}
}
say "Shouldn't get here!";
Using the helper functions any and reduce from the core module List::Util and the built in abs.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce any/;
my #arr1 = qw(10 15 20 25 30);
chomp(my $in = <STDIN>);
if (any {$in == $_} #arr1) {
print "$in is in the array\n";
}
else {
my $i = reduce { abs($in - $arr1[$a]) > abs($in - $arr1[$b]) ? $b : $a} 0 .. $#arr1;
print "$in is closest to $arr1[$i]\n";
}
I am fairly new to Perl and am having a hard time grasping the behavior of the following password input code snippet:
use Win32::Console;
my $StdIn = new Win32::Console(STD_INPUT_HANDLE);
my $Password = "";
$StdIn->Mode(ENABLE_PROCESSED_INPUT());
local $| = 1;
print "Enter Password: ";
while (my $Data = $StdIn->InputChar(1)) {
if ("\r" eq $Data ) {
last;
}
elsif ("\ch" eq $Data ) {
if ( "" ne chop( $Password )) {
print "\ch \ch";
}
next;
}
$Password .=$Data;
print "*";
}
while (my $Data = $StdIn->InputChar(1)) {
print "\nShow password? [y/n] ";
if ("n" eq $Data) {
last;
}
elsif ("y" eq $Data) {
print "\nPassword: $Password\n";
last;
}
}
Basically what happens is that the script prompts the user for a password and displays * for every character input as expected but requires Enter to be pressed twice to accept the input. However, if I delete the second while loop (or replace with a print $password statement) the input only requires one press of Enter.
I have also noticed that in the second while loop, which prompts the user to enter y or n (without needing to press Enter) if the user enters 'y' then the line Show password? [y/n] is repeated before displaying the password.
Some insight on this behavior would be appreciated.
The first Enter gets you out of the first while loop. The second while loop then waits for another character before displaying the prompt. You should display the prompt before asking for another character (and display it only once).
Breaking things into subroutines helps build on basic blocks.
use strict; use warnings;
use Win32::Console;
run();
sub run {
my $StdIn = Win32::Console->new(STD_INPUT_HANDLE);
$StdIn->Mode(ENABLE_PROCESSED_INPUT);
my $Password = prompt_password($StdIn, "Enter Password: ", '*');
if ( prompt_echo($StdIn, "\nShow password? [y/n] ") ) {
print "\nPassword = $Password\n"
}
return;
}
sub prompt_password {
my ($handle, $prompt, $mask) = #_;
my ($Password);
local $| = 1;
print $prompt;
$handle->Flush;
while (my $Data = $handle->InputChar(1)) {
last if "\r" eq $Data;
if ("\ch" eq $Data ) {
if ( "" ne chop( $Password )) {
print "\ch \ch";
}
next;
}
$Password .= $Data;
print $mask;
}
return $Password;
}
sub prompt_echo {
my ($handle, $prompt) = #_;
local $| = 1;
print $prompt;
$handle->Flush;
while (my $Data = $handle->InputChar(1)) {
return if "n" eq $Data;
return 1 if "y" eq $Data;
}
return;
}
I have a strange problem with Magentos EAV-Databasesystem.I want to store a dataset like the next codeblock. But one value 'StoreId' was not stored. Do you have any Idea why one of eight values can stored in a table.
public function populateEntriesAction()
{
//return;
for ( $i = 0; $i < 5; $i++ )
{
$branchEntry = Mage::getModel('branch/eavisbranch');
$branchEntry->setName('abc ' . $i);
$branchEntry->setStreet('abcdefg ' . $i);
.....
$branchEntry->setStorePickup(false);
$branchEntry->setStoreId('3');
$branchEntry->save();
}
echo 'Done';
}
The following perl sub is used to store arrays of hashes.
Each hash to be stored is first checked for uniqueness using a given key, if a hash exists on the array with the same key value then it's not stored.
How can this perl sub be optimised for speed?
Example use:
my #members;
...
$member= {};
$hash->{'name'}='James';
hpush('name', \#members,$member);
The sub:
sub hpush {
# push a set of key value pairs onto an array as a hash, if the key doesn't already exist
if (#_ != 3) {
print STDERR "hpush requires 3 args, ".#_." given\n";
return;
}
my $uniq = shift;
my $rarray = shift;
my $rhash = shift;
my $hash = ();
#print "\nHash:\n";
for my $key ( keys %{$rhash} ) {
my $valuea = $rhash->{$key};
#print "key: $key\n";
#print "key=>value: $key => $valuea\n";
$hash->{ $key} = $valuea;
}
#print "\nCurrent Array:\n";
for my $node (#{$rarray}) {
#print "node: $node \n";
for my $key ( keys %{$node} ) {
my $valueb = $node->{$key};
#print "key=>value: $key => $valueb\n";
if ($key eq $uniq) {
#print "key=>value: $key => $valueb\n";
if (($valueb =~ m/^[0-9]+$/) && ($hash->{$key} == $valueb)) {
#print "Not pushing i $key -> $valueb\n";
return;
} elsif ($hash->{$key} eq $valueb) {
#print "Not pushing s $key -> $valueb\n";
return;
}
}
}
}
push #{$rarray}, $hash;
#print "Pushed\n";
}
Note that the perl isn't mine and I'm a perl beginner
This code is rather... not very efficient. First, it copies $rhash to $hash, with a for loop... for some reason. Then it loops through the hash keys, instead of simply using the hash key that it's looking for. Then it does two equivalent checks, apparently some attempt to distinguish numbers from non-numbers and selecting the appropriate check (== or eq). This is all unnecessary.
This code below should be roughly equivalent. I've trimmed it down hard. This should be as fast as it is possible to get it.
use strict;
use warnings;
hpush('name', \#members,$member);
sub hpush {
my ($uniq, $rarray, $rhash) = #_;
for my $node (#{$rarray}) {
if (exists $node->{$uniq}) {
return if ($node->{$uniq} eq $rhash->{$uniq});
}
}
push #{$rarray}, $rhash;
}