Scaning and matching letter sequences using matrix - algorithm

I have two txt files, then I put them into hash'es, sequence => title.
In file DUOMENYS.txt title is known, in file "DUOTA.txt" title is unknown.
So for every sequence in file "DUOTA.txt" I need to find similar sequence in file DUOMENYS.txt and then print that known title.
I tried to make this with slimple matching, printing title with more than 90% sequence symbol matches, but I was told that it is wrong, and I have to do it other way, with this table: http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt
I have to compare letters from known sequence and unknown sequence and get number
(-4)-9, if sum of all that numbers => length of sequence * 3, print that title
Example, ABW => TILE1, DUOMENYS.txt, ABF => UNKNOWNTITLE, DUOTA.txt,
A B W
A B F
4 4 1 sum = 9
length 3 x 3 = 9
9 => 9, true, print.
So the problem is I don't know how to make it happen....
#!/usr/bin/perl
use strict;
use Data::Dumper;
open (OUTPUT, ">OUTPUT.txt") or die "$!"; #Turimos vairenio sekos
open (DUOMENYS, "DUOMENYS.txt") or die "$!";
open (OUTPUT1, ">OUTPUT1.txt") or die "$!"; #Tiriamos sekos
open (DUOTA, "DUOTA.txt") or die "$!";
open (OUTPUT2, ">OUTPUT2.txt") or die "$!"; #rezultatai
open (MATRIX, "MATRIX.txt") or die "$!";
#--------------------DUOMENYS-HASH-----------------------------
#my $contentFile = $ARGV[0];
my $contentFile = <DUOMENYS>;
my %testHash = ();
my $currentKey = "";
my $seka = "";
my %nhash = ();
open(my $contentFH,"<",$contentFile);
while(my $contentLine = <DUOMENYS>){
chomp($contentLine);
next if($contentLine eq ""); # Empty lines.
if($contentLine =~ /^\>(.*)/){
$testHash{$currentKey} = $seka;
$currentKey= $1;
$seka = "";
}else{
$seka .= $contentLine;
}
}
#-------------------DUOTA-HASH-------------------------------------
#my $contentFile1 = $ARGV[0];
my $contentFile1 = <DUOTA>;
my %testHash1 = ();
my $currentKey1 = "";
my $seka1 = "";
my %nhash1 = ();
open(my $contentFH1,"<",$contentFile1);
while(my $contentLine1 = <DUOTA>){
chomp($contentLine1);
next if($contentLine1 eq ""); # Empty lines.
if($contentLine1 =~ /^\>(.*)/){
$testHash1{$currentKey1} = $seka1;
$currentKey1= $1;
$seka1 = "";
}else{
$seka1 .= $contentLine1;
}
}
#--------------------OUTPUT-HASH------------------------------------
%nhash = reverse %testHash;
print OUTPUT Dumper(\%nhash);
%nhash1 = reverse %testHash1;
print OUTPUT1 Dumper(\%nhash1);
#---------------------MATCHING---------------------------------------
my $klaidu_skaicius = 0;
my #sekos = keys %nhash;
my #duotos_sekos = keys %nhash1;
my $i = 0;
my $j = 0;
for($i = 0; $i <= scalar#sekos; $i++){
for($j = 0; $j <= scalar#duotos_sekos; $j++){
$klaidu_skaicius = (#sekos[$i] ^ #duotos_sekos[$j]) =~ tr/\0//c;
if($klaidu_skaicius <= length(#sekos[$i])/10){
print OUTPUT2 substr( $nhash{#sekos[$i]}, 0, 9 ), "\n";
}
else{
print OUTPUT2 "";
}
}
}
From comments
pastebin.com/7QnBDTDY – povilito May 30 at 11:57
Its too big (15mb) for pastebin.com – povilito May 30 at 12:01
filedropper.com/duomenys – povilito May 30 at 12:04
I think comparing "letter" with " "(space) or "" should give us number -4 – povilito May 30 at 12:28
It's enougth to find one title for one unknown sequence. – povilito May 30 at 12:45
So if there is 50 unknown sequences, output file should give us 50 titles, some tittles can be the same :)

Here is the basic version of solution to your problem, as per I understood. I considered both the sequence equal, you can make the changes easily if you need.
#!/usr/bin/perl
use strict;
use warnings;
# convert your matrix file to 2D hash
my %hash;
open my $fh, '<', 'matrix' or die "unable to open file: $!\n";
chomp(my $heading = <$fh>);
# strip out space from begining
$heading =~ s/^\s+//g;
my #headings = split (/\s+/, $heading);
while(<$fh>) {
chomp;
my #line = split;
my $key1 = shift #line;
foreach my $i( 0 .. $#line) {
$hash{$key1}{$headings[$i]} = $line[$i];
}
}
close $fh;
# Took from your example for reference
my $duameny = "ABW";
my $duota = "ABF";
my #duamenys = split (//,$duameny);
my #duotas = split (//,$duota);
# calculate sum from hash
# considering both sequences are equal
my $sum = 0;
foreach my $i (0 .. $#duamenys) {
$sum += $hash{$duamenys[$i]}{$duotas[$i]};
}
# calculate length from sequence
my $length = (length $duameny) * 3;
print "SUM: $sum, Length: $length\n";
if($sum >= $length) {
# probably you know how to print the title
# print the title from duamenys.txt
}
Below is summary of my approach.
First converted the matrix into 2D hash for lookup.
Calculated sum from the hash for sequences.
Calculated length from the sequences
Compare them and print the title if sum >= length .
output:
SUM: 9, Length: 9

Related

Perl - to create and sort hash to produce a sorted output

So I have this script that scrape data to a website, its getting and downloading a CSV and its process the CSV row by row and converts it into TSV, once that finished the TSV file will be converted into a HTML file. I'm done the rest of that but the output that I'm getting is some what messed up, the script goes to different table pages on the source site and downloads a dynamically generated CSV file; that CSV file is then turned into a TSV file that we then turn into HTML. The CSV file seems to be sorted by the first column for each row that is returned but not based on any of the other columns in the same row. Therefore what is happening is that entries with the same first column values can be jumbled up from one download to the next download of the same file.
A visual representation of only sorting by the first column this follows with numbers representing column data:
1st Download:
1-1
1-2
1-3
2-1
2-2
2-3
3-1
3-2
3-3
2nd Download:
1-1
1-3
1-2
2-2
2-1
2-3
3-3
3-2
3-1
So what I have in mind is the process will be like this, download the CSV file from the source and then perform a sort on the lines in that CSV file to normalize them for comparison to one another before writing the TSV or HTML files. This should allow for accurate comparison for updated data files. but I didn't know how to do this my logic is like this
So I will put the function between the 1. and 2. before it process the CSV file into TSV File I want the content of the CSV is already sorted.
So my script is looking like this
my $download_dir_link ="C:/Users/jabella/Downloads";
unlink("$download_dir_link/Product Classification List.csv");
#CHECK IF CSV FILE DOWNLOAD IS FINISHED
my $complete_download_flag = 0;
while($complete_download_flag == 0)
{
my #download_directory = read_dir($download_dir_link);
foreach my $downloaded_file (#download_directory)
{
if($downloaded_file =~ /\QProduct Classification List.csv\E/sgi)
{
$complete_download_flag = 1;
}
}
sleep(5);
}
#SORTED CONTENTS OF CSV BEFORE CONVERSION
print "sORTING csv content...\n";
#CONVERT CSV TO TSV
print "Converting csv to tsv...\n";
my $csv = Text::CSV->new ({ binary => 1 });
my $tsv = Text::CSV->new ({ binary => 1, sep_char => "\t", eol => "\n"});
open my $infh, "<:encoding(utf8)", "$download_dir_link/Product Classification List.csv";
open my $outfh, ">:encoding(utf8)", "Product Classification List.tsv";
while (my $row = $csv->getline ($infh))
{
$tsv->print ($outfh, $row);
}
close($infh);
close($outfh);
my $tsv_content = "";
open(my $fh, '<', "Product Classification List.tsv");
while (<$fh>)
{
$tsv_content = $tsv_content.$_;
}
close($fh);
print "Conversion complete! cleaning tsv content...\n";
#CLEAN TSV CONTENT
$tsv_content =~ s/(.*?)\t"(.*?)"\t"(.*?)"\t"(.*?)"\t(.*?)\t"(.*?)"\t(.*)/<tr><th>$1<\/th><th>$2<\/th><th>$3<\/th><th>$4<\/th><th>$5<\/th><th>$6<\/th><th>$7<\/th><\/tr>/gi;
$tsv_content =~ s/"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\n/<tr><td>$1<\/td><td>$2<\/td><td>$3<\/td><td>$4<\/td><td>$5<\/td><td>$6<\/td><td>$7<\/td><\/tr>\n/gi;
$tsv_content =~ s/\"{2}/\"/sgi;
$tsv_content =~ s/(<\/tr>)\n?"/$1/sgi;
$tsv_content =~ s/\s{2,}/ /sgi;
$tsv_content =~ s/.*?(<tr>)/$1/si;
$tsv_content = "<table>\n$tsv_content</table>";
$classification =~ s/_//sgi;
if(exists $existing_index_hash{$doc_uid."_pind.html"})
{
if($existing_index_hash{$doc_uid."_pind.html"} ne $tsv_content)
{
$changed_flag = "1";
$updated_files = $updated_files."-$classification\n";
print "Updated: $classification\n";
Hope someone here can help me on this thank you
Here is a simple script that loads a CSV file specified as an argument and outputs it sorted by the first two columns.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = 'Text::CSV_XS'->new({binary => 1, auto_diag => 1});
open my $in, '<', shift or die $!;
my #rows;
while (my $row = $csv->getline($in)) {
push #rows, $row;
}
# Here the sorting happens. Compare the first column,
# if the values are the same, compare the second column.
#rows = sort { $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] } #rows;
$csv->say(*STDOUT, $_) for #rows;
You can use the following to sort by all columns (but it compares the values as strings, it doesn't work for numbers):
sub by_all {
my ($n, $A, $B) = #_;
$A->[$n] cmp $B->[$n]
|| $n < $#$A && by_all($n + 1, $A, $B)
}
sort { by_all(0, $a, $b) } #rows;
To make it work for numbers, too, you can let Perl guess what is a number:
use Scalar::Util qw{ looks_like_number };
sub by_all {
my ($n, $A, $B) = #_;
(looks_like_number($A->[$n])
? $A->[$n] <=> $B->[$n]
: $A->[$n] cmp $B->[$n]
) || $n < $#$A && by_all($n + 1, $A, $B)
}

Compare strings and remove more general pattern in Perl

I have an array of strings that numbers, possibly separated by forward slashes, e.g. 754 or 754/128. These strings can have an undefined length, in other words: something such as the following is possible: 1234/34/21/120/3. In the array I want to only keep the more specialised patterns that contain other patterns. For instance, in the first example above 754/128 contains 754, so 754 can be removed from the array.
This concept of containing is as broad as one would expect, maybe even broader: it is similar to how you look at a directed graph where each slash in the pattern refers to a step forward. the contained pattern can be of arbitrary length as long as it is inside the containing pattern one way or another. This means that the small path can occur in any (chronologically correct) form. For instance, 903/900 is contained in 903/902/900 even though the pattern is 'split open'. One way to visualise this is: in the small path we go from point A to point B. In the larger path we also go from pont A to B but we make a stop at C. The larger path visits more places than the small path without missing out on anything. As a consequence, the smaller path can occur in any split-up form - as long as the order of the path is respected. As an example:
2/5 - 1/2/3/4/5
# included
5/2 - 1/2/3/4/5
# not included
What I mean here is that the position of the 'contained' items should be identical in the large path. For instance: 1/3/2 'matches' in 1/5/3/4/2 because the order is identical in the small and large path: 1 comes in a position before 3, which in turn is in some position before 2. 1/2/3, 2/1/3 and so on would NOT match the larger path 1/5/3/4/2 even though they are valid paths with the same items. This is because the order of occurrence is different.
The above example also illustrates that the items in the small pattern can occur any where in the large pattern; not only in the first and last position or in subsequent positions. In other words, all included paths for 1/2/3/4 are:
1/2
1/2/3
1/3
1/4
2/3
2/3/4
2/4
3/4
I am looking for an efficient way to remove paths in a given array that are included in others from that same array.
I got this far, but I'm not sure how I should efficiently check the contains relationship between two items.
#!/usr/bin/perl
my #arr = ("903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903");
my #res = ();
OUTER: for (my $i = 0; $i < #arr; $i++) {
my $first = $arr[$i];
my $j = $i+1;
INNER: while($j < #arr) {
my $second = $arr[$j];
&compare_paths($first, $second);
$j++;
}
}
sub compare_paths {
my ($first, $second) = #_;
#first_items = split(/\//, $first);
#second_items = split(/\//, $second);
# Compare values from 1 and 2
}
The expected output for the code above would be
#res = ("903/904/902/901", "903/904/902/908/900");
Reasons for removal:
903/900 contained in 903/902/900
903/902/900 contained in 903/904/902/908/900
903 contained in 903/904/902/901
How can I implement such algorithm efficiently? My main idea would be to check if an item of #first_items is a present in $second, if not continue, but if so check if a second item is also present and if so: check its substring position. This has to be larger than the first items substring position. Continue for each item (and the other way around for #second_items and $first) until all strings are matched. (If it helps for speed, the initial array can be swapped for a hash with the former array as the keys.)
I expect that there are general algorithms that can account for this problem, and probably libraries that can be leveraged. However, here is a hand-rolled one.
First, we sort the array by the number of terms in the path . Then we go up that array, comparing each element with all longer ones. This way each path is excluded at earliest opportunity.
The comparison is between arrays obtained by splitting on /. It checks whether all elements of the smaller array are in the larger one as an exact sub-sequence, so that the larger one would yield the smaller one by only removing elements (without rearrangements).
use warnings;
use strict;
my #arr = qw(902/904 903/900 903/902/900 903/904/902/901
903/904/902/908/900 903);
my #sorted = sort { (split '/', $a) > (split '/', $b) } #arr;
my #primes;
OUTER:
for my $i (0..$#sorted) {
for my $j ($i+1..$#sorted) {
next OUTER if is_contained($sorted[$i], $sorted[$j]);
}
push #primes, $sorted[$i];
}
print "#primes\n";
sub is_contained
{
my ($small, $large) = #_;
my #small = split '/', $small;
my #large = split '/', $large;
# There can be no duplicates so equal-length paths are distinct
return 0 if #small == #large;
# Indices of elements of #small in #large cannot decrease
my ($match, $index) = (0, 0);
for my $sm (#small) {
for my $i (0..$#large) {
$sm == $large[$i] || next;
return 0 if $i < $index; # out of order
$index = $i;
$match = 1;
last;
}
return 0 if not $match; # $sm from #small not in #large
$match = 0;
}
return 1;
}
Prints the line: 902/904 903/904/902/901 903/904/902/908/900.
A note on how we check that the #smaller matches a subsequence in #larger.
Once a #smaller element is found in #larger, its index in #larger
cannot be lower than the one found previously. An element must come after the previous one, not before. See a different procedure below.
So with 2/7/5 and 1/2/5/7/8, first 2 is found at index 1, then 7 at index 3, then 5 but at index 2. The subsequence 2-5-7 does not match 2-7-5. I added 902/904 to data to test for this.
This is an alternate procedure for checking whether a path is contained in another.
Once it finds an element of #smaller in #larger it searches for the next one starting from the next index in #larger. This way it skips the searched part of the path, but it cannot detect out-of-order elements early either.
With the example of 2/7/5 and 1/2/5/7/8, after it found 7 at index 3 it starts from index 4 and detects failure by not finding a 5 in the rest of the target path.
sub is_contained_2
{
my #large = split '/', $_[0];
my #small = split '/', $_[1];
# Is #small found in #large as an exact sub-sequence?
my ($match, $j) = (0, 0);
for my $sm (#small) {
for my $i ($j..$#large) {
$sm == $large[$i] || next;
$j = $i+1, $match = 1;
last;
}
return 0 if not $match;
$match = 0;
}
return 1;
}
This is slower (by 10-15%) for this data set, see benchmark with a comment below.
I benchmarked the two array-based versions here and ikegami's regex+trie. So far I have used only the specific data set from the question, with 902/904 added.
use warnings;
use strict;
use Benchmark qw(cmpthese);
my $secs_to_run = shift || 10;
my #arr = ('902/904', '903/900', '903/902/900', '903/904/902/901',
'903/904', '/902/908/900', '903');
# sorted array checked shorter-to-longer, manual iterations
sub contained {
my ($rarr) = #_; my #arr = #$arr;
# program copied from this post
return \#primes;
}
sub is_contained { ... } # copied
# Same program, but using is_contained_2()
sub contained_2 { ... }
sub is_contained_2 { ... }
# Regex-trie, copied from ikegami's post
sub add { my $p = \shift; $p = \( $$p->{$_} ) for #_, ''; }
sub as_pat { my $trie = shift; ... } # copied
sub regex_trie {
my ($rpaths) = #_; my #paths = #$rpaths;
# program copied from ikegami's post
return \#filtered_paths;
}
cmpthese(-$secs_to_run, {
containted => sub { my $rprimes = contained(\#arr) },
cont_next => sub { my $rprimes = contained_2(\#arr) },
regex_trie => sub { my $rfiltered = regex_trie(\#arr) },
});
With bench_cont.pl 300, on a newer workstation-laptop (2.5GHz) with v5.16
Rate regex_trie cont_next containted
regex_trie 15264/s -- -15% -27%
cont_next 17946/s 18% -- -14%
containted 20939/s 37% 17% --
on an older server (2.8GHz) with v5.16
Rate regex_trie cont_next containted
regex_trie 11750/s -- -13% -27%
cont_next 13537/s 15% -- -16%
containted 16042/s 37% 19% --
on an older server (3.5GHz) with v5.10
Rate cont_next regex_trie containted
cont_next 12266/s -- -17% -17%
regex_trie 14832/s 21% -- -0%
containted 14845/s 21% 0% --
This surprised me, as I expected the regex-based solution to be fastest.
I expect the trend to reverse for data composed of longer paths, having more distinct (not contained) paths, with containment found later in the path, and with a few out-of-order dismissals.
I'll add tests once I get to generate such data, or once it is provided.
To track some of the processing change the body to
use feature 'say';
OUTER:
for my $i (0..$#sorted) {
say "Check $sorted[$i]";
for my $j ($i+1..$#sorted) {
my $is_inside = is_contained($sorted[$i], $sorted[$j]);
say "\t$is_inside: $sorted_arr[$i] inside $sorted_arr[$j]";
next OUTER if $is_inside;
}
push #primes, $sorted[$i];
}
say "\nNot contained: #primes";
This prints
Check 903
0: 903 vs. 902/904
1: 903 vs. 903/900
Check 902/904
0: 902/904 vs. 903/900
0: 902/904 vs. 903/902/900
0: 902/904 vs. 903/904/902/901
0: 902/904 vs. 903/904/902/908/900
Check 903/900
1: 903/900 vs. 903/902/900
Check 903/902/900
0: 903/902/900 vs. 903/904/902/901
1: 903/902/900 vs. 903/904/902/908/900
Check 903/904/902/901
0: 903/904/902/901 vs. 903/904/902/908/900
Check 903/904/902/908/900
Not contained: 902/904 903/904/902/901 903/904/902/908/900
To optimize what you have so far, I'd suggest that you pre-split all elements of the array (and then re-merge them later):
#arr = map [split "/", $_], #arr;
With that done:
sub contains(\#\#) {
my ($larger_ref, $smaller_ref) = #_;
return '' if #$larger_ref <= #$smaller_ref;
my ($i, $j) = 0;
while ($i < #$larger_ref && $j <= #$smaller_ref) {
++$j if $larger_ref->[$i] == $smaller_ref->[$j];
++$i;
}
return $j == #$smaller_ref;
}
I: for (my $i = 0; $i < #arr; ++$i) {
J: for (my $j = 0; $j < #arr; ++$j) {
next J if $j == $i;
next I if contains #{$arr[$j]}, #{$arr[i]};
}
push #res, join '/', #{$arr[$i]};
}
There are some potential further optimizations you can make in contains (for example, it might make sense to abort early if/when #$larger_ref - $i < #$smaller_ref - $j), but you'd want to test: they might turn out to be pessimizations.
If a/b/c is path, you want to remove the paths for which the following is true:
"/$path" =~ m{ ^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z }x && $path ne 'a/b/c'
That can also be written as
"/$path" =~ m{ ^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c ) }x
If both a/b/c, a/i/j, a/x/y and d/e/f are paths, you want to remove the paths for which the following is true:
"/$path" =~ m{
^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c )
| ^ (?:/a)?+ (?:/i)?+ (?:/j)?+ \z (?<! ^ /a/i/j )
| ^ (?:/a)?+ (?:/x)?+ (?:/y)?+ \z (?<! ^ /a/x/y )
| ^ (?:/d)?+ (?:/e)?+ (?:/f)?+ \z (?<! ^ /d/e/f )
}x
We removed alot of the backtracking by using the possessive modifier (+), but there is still a possibility for backtracking because of the common prefixes. So let's remove them!
"/$path" =~ m{
^
(?: (?:/a)?+ (?: (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c )
| (?:/i)?+ (?:/j)?+ \z (?<! ^ /a/i/j )
| (?:/x)?+ (?:/y)?+ \z (?<! ^ /a/x/y )
)
| (?:/d)?+ (?:/e)?+ (?:/f)?+ \z (?<! ^ /d/e/f )
)
}x
Now we have an efficient solution!
The following uses a trie to do this to remove the common prefixes.
use strict;
use warnings;
use feature qw( say );
sub add {
my $p = \shift;
$p = \( $$p->{$_} ) for #_, '';
}
sub as_pat {
my $trie = shift;
my #sub_pats =
map { $_ eq '' ? '' : $_ . as_pat($trie->{$_}) }
keys(%$trie);
if (#sub_pats == 1) {
return $sub_pats[0];
} else {
return '(?:'.join('|', #sub_pats).')';
}
}
my #paths = ( "903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903" );
my $trie;
add($trie, map({ "(?:/\Q$_\E)?+" } split qr{/}), "\\z(?<!^/\Q$_\E)" )
for #paths;
my $pat = as_pat($trie);
my $re = qr{^(?:$pat)};
my #filtered_paths = grep { "/$_" !~ /$re/ } #paths;
say for #filtered_paths;
Say N is the number of paths, and O(M) bounds the length of the paths. Like earlier answers, this one completes in O(N2 * M2) time, so it doesn't scale any better. However, I believe you will find mine faster should you benchmark them.
Say N is the number of paths, and O(M) bounds the length of the paths, the earlier answers complete in O(N2 * M2) time.
The following solution is O(N * 2M). This means it can handle very large numbers of paths much much more efficiently, as long as the paths are rather short (because it effectively becomes O(N2) vs O(N)). It does require far more memory than the solutions in the earlier answers.
use strict;
use warnings;
use feature qw( say );
sub fetch {
my $trie = shift;
for (#_, '') {
return () if !$trie;
$trie = $trie->{$_}
}
return $trie;
}
sub add {
local *_helper = sub {
my $trie_ptr = \shift;
my $exact = shift;
if (#_) {
my $lead = shift(#_);
_helper($$trie_ptr->{$lead}, $exact, #_);
_helper($$trie_ptr, 0, #_);
} else {
if ($exact) {
$$trie_ptr->{''} ||= 1;
} else {
$$trie_ptr->{''} = 2;
}
}
};
my $trie_ptr = \shift;
return _helper($$trie_ptr, 1, #_);
}
my #paths = ( "903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903" );
my #split_paths = map [ split qr{/} ], #paths;
my $trie;
add($trie, #$_)
for #split_paths;
use Data::Dumper qw( Dumper );
local $Data::Dumper::Sortkeys = 1;
print(Dumper($trie));
my #filtered_paths =
map join('/', #$_),
grep { fetch($trie, #$_) == 1 }
#split_paths;
say for #filtered_paths;

Need simple parallelism example in Perl 6

I am trying to learn Perl 6 and parallelism/concurrency at the same time.
For a simple learning exercise, I have a folder of 550 '.htm' files and I want the total sum of lines of code among all of them. So far, I have this:
use v6;
my $start_time = now;
my $exception;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
for #files -> $file {
$total_lines += $file.lines.elems;
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
say $total_lines;
say now - $start_time;
That gives a sum of 577,449 in approximately 3 seconds.
How would I rewrite that to take advantage of Perl 6 parallelism ideas? I realize the time saved won't be much but it will work as proof of concept.
Implementing Christoph's suggestion. The count is slightly higher than my original post because I'm now able to read in the malformed UTF-8 files using encode latin1.
use v6;
my $start_time = now;
my #files = "c:/iforms/live".IO.dir(test => / '.' htm $/);
my $total_lines = [+] #files.race.map(*.lines(:enc<latin1>).elems);
say $total_lines;
say now - $start_time;
use v6;
my $start_time = now;
my $exception;
my #lines;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
await do for #files -> $file {
start {
#lines.push( $file.lines.elems );
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
}
$total_lines = [+] #lines;
say $total_lines;
say now - $start_time;

How Break The Infinite Loop in Perl Under Redo (generating random number)

I intend to generate random number the following step:
Read the data from file (<DATA>)
Generate random numbers as many as the input data lines
The random number should not be generated twice,
e.g. if the rand number generated in loop no x, has been created
before then, recreate the random number.
Here is the code I have which leads to infinite loop.
What's wrong with my logic, and how can I fix it?
#!/usr/bin/perl -w
use strict;
my %chrsize = ('chr1' =>249250621);
# For example case I have created the
# repository where a value has been inserted.
my %done =("chr1 182881372" => 1);
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
# this number has been generated before
# with this: int(rand($chrsize{$chr}));
# hence have to create other than this one
my $newst =182881372;
my $newpos = $chr ."\t".$newst;
# recreate random number
for (0...10){
if ( $done{$newpos} ) {
# INFINITE LOOP
$newst = int(rand($chrsize{$chr}));
redo;
}
}
$done{$newpos}=1;
print "$newpos\n";
}
__DATA__
# In reality there are 20M of such lines
# name positions
chr1 157705682
chr1 19492676
chr1 169660680
chr1 226586538
chr1 182881372
chr1 11246753
chr1 69961084
chr1 180227256
chr1 141449512
You had a couple of errors:
You were setting $newst within your loop every time, so $newpos never took on a new value.
Your inner for loop didn't make sense, because you never actually changed $newpos before checking the condition again.
redo; was working on the inner loop.
Here is a corrected version that avoids redo altogether.
Update: I edited the algorithm a bit to make it simpler.
#!/usr/bin/perl -w
use strict;
my $chr1size = 249250621;
my %done;
my $newst;
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
my $newpos;
#This will always run at least once
do {
$newst = int(rand($chr1size));
$newpos = $chr ."\t".$newst;
} while ( $done{$newpos} );
$done{$newpos}=1;
print "$newpos\n";
}
Update 2: while the above algorithm will work, it will get really slow on 20,000,000 lines. Here is an alternative approach that should be faster (There is sort of a pattern to the random numbers it generates, but it would probably ok for most situations).
#!/usr/bin/perl -w
use strict;
my $newst;
#make sure you have enough. This is good if you have < 100,000,000 lines.
use List::Util qw/shuffle/;
my #rand_pieces = shuffle (0..10000);
my $pos1 = 0;
my $offset = 1;
while ( <DATA> ) {
chomp;
next if (/^\#/);
my ($chr,$pos) = split(/\s+/,$_);
$newst = $rand_pieces[$pos1] * 10000 + $rand_pieces[($pos1+$offset)%10000];
my $newpos = $chr ."\t".$newst;
$pos1++;
if ($pos1 > $#rand_pieces)
{
$pos1 = 0;
$offset = ++$offset % 10000;
if ($offset == 1) { die "Out of random numbers!"; }
}
print "$newpos\n";
}
Add a counter to your loop like this:
my $counter = 0;
# recrate
for (0...10){
if ( $done{$newpos} ) {
# INFINITE LOOP
$newst = int(rand($chr1size));
redo if ++$counter < 100; # Safety counter
# It will continue here if the above doesn't match and run out
# eventually
}
}
To get rid of the infinite loop, replace redo with next.
http://www.tizag.com/perlT/perlwhile.php :
"Redo will execute the same iteration over again."
Then you may need to fix the rest of the logic ;).

How to iterate through range of Dates?

In my script, I need to iterate through a range of dates given the start date and end date. How can I do this in Perl?
Use DateTime module. Here is a simple example which lists the ten previous days:
use 5.012;
use warnings;
use DateTime;
my $end = DateTime->now;
my $day = $end->clone->subtract( days => 10 ); # ten days ago
while ($day < $end) {
say $day;
$day->add( days => 1 ); # move along to next day
}
Update (after seeing your comment/update):
To parse in a date string then look at the DateTime::Format on modules CPAN.
Here is an example using DateTime::Format::DateParse which does parse YYYY/MM/DD:
use DateTime::Format::DateParse;
my $d = DateTime::Format::DateParse->parse_datetime( '2010/06/23' );
One easy approach is to use the Date::Simple module, which makes use of operator-overloading:
use strict;
use warnings;
use Date::Simple;
my $date = Date::Simple->new ( '2010-01-01' ); # Stores Date::Simple object
my $endDate = Date::Simple->today; # Today's date
while ( ++$date < $endDate ) {
print ( $date - $endDate ) , "day",
( ( $date-$endDate) == 1 ? '' : 's' ), " ago\n";
}
use DateTime::Format::Strptime qw();
my $start = DateTime::Format::Strptime->new(pattern => '%Y/%m/%d')->parse_datetime('2010/08/16');
my $end = DateTime::Format::Strptime->new(pattern => '%Y/%m/%d')->parse_datetime('2010/11/24');
while ($start < $end) {
$start->add(days => 1);
say $start->ymd('/');
}
I like to use the fact that strftime will normalize the date for me:
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw/strftime/;
my $start = "2010/08/16";
my $end = "2010/09/16";
my #time = (0, 0, 0);
my ($y, $m, $d) = split "/", $start;
$y -= 1900;
$m--;
my $offset = 0;
while ((my $date = strftime "%Y/%m/%d", #time, $d + $offset, $m, $y) le $end) {
print "$date\n";
} continue {
$offset++;
}
You can try Date::Calc::Iterator
# This puts all the dates from Dec 1, 2003 to Dec 10, 2003 in #dates1
# #dates1 will contain ([2003,12,1],[2003,12,2] ... [2003,12,10]) ;
my $i1 = Date::Calc::Iterator->new(from => [2003,12,1], to => [2003,12,10]) ;
my #dates1 ;
push #dates1,$_ while $_ = $i1->next ;
If installing extra perl modules is not preferable, one can use this approach, based on a core perl library POSIX:
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);
# CREATE CALENDAR
my #Calendar = ();
my $years = 3;
my #Now = localtime(); # An array of 9 date-time parameters.
for my $count ( 0 .. ( 365 * $years ) ) {
# If date is January 1st, manual shift to December 31st is needed,
# because days ([yday][2]) are counted from January 31st and never shift back one year.
if( $Now[4] == 0 && $Now[3] == 1 ) {
unshift #Calendar, strftime( "%Y-%m-%d", #Now );
$Now[5] --; # Reduce by one the sixth array element #5 - year.
$Now[4] = 11; # Set fifth array element № 4 - to December.
$Now[3] = 31; # Set fourth array element № 3 - to 31st.
} else {
unshift #Calendar, strftime( "%Y-%m-%d", #Now );
$Now[3] --;
}
}
# Print out.
my $size = #Calendar;
for (my $i = 0; $i < $size; $i++) {
print $Calendar[$i]."\n";
}
Perl has a rich array of time and date manipulation modules, as seen here:
http://datetime.perl.org/?Modules
And there are some examples of date and time problems there as well.
With Perl, there's always more than one way to do it.

Resources