Algorithm to do numeric profile of the string - algorithm

I have few file similar to below, and I am trying to do numeric profiling as mentioned in the image
>File Sample
attttttttttttttacgatgccgggggatgcggggaaatttccctctctctctcttcttctcgcgcgcg
aaaaaaaaaaaaaaagcgcggcggcgcggasasasasasasaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
I have to map each substring of size 2 and then map it to 33 value for different ptoperties and then add as per the window size of 5.
my %temp = (
aCount => {
aa =>2
}
cCount => {
aa => 0
}
);
My current implementation include as per below ,
while (<FILE>) {
my $line = $_;
chomp $line;
while ($line=~/(.{2})/og) {
$subStr = $1;
if (exists $temp{aCount}{$subStr}) {
push #{$temp{aCount_array}},$temp{aCount}{$subStr};
if (scalar(#{$temp{aCount_array}}) == $WINDOW_SIZE) {
my $sum = eval (join('+',#{$temp{aCount_array}}));
shift #{$temp{aCount_array}};
#Similar approach has been taken to other 33 rules
}
}
if (exists $temp{cCount}{$subStr}) {
#similar approach
}
$line =~s/.{1}//og;
}
}
is there any other approach to increase the speed of the overall process

Regular expressions are awesome, but they can be overkill when all you need are fixed width substrings. Alternatives are substr
$len = length($line);
for ($i=0; $i<$len; $i+=2) {
$subStr = substr($line,$i,2);
...
}
or unpack
foreach $subStr (unpack "(A2)*", $line) {
...
}
I don't know how much faster either of these will be than regular expressions, but I know how I would find out.

Related

Perl - to create and sort hash to produce a sorted output

So I have this script that scrape data to a website, its getting and downloading a CSV and its process the CSV row by row and converts it into TSV, once that finished the TSV file will be converted into a HTML file. I'm done the rest of that but the output that I'm getting is some what messed up, the script goes to different table pages on the source site and downloads a dynamically generated CSV file; that CSV file is then turned into a TSV file that we then turn into HTML. The CSV file seems to be sorted by the first column for each row that is returned but not based on any of the other columns in the same row. Therefore what is happening is that entries with the same first column values can be jumbled up from one download to the next download of the same file.
A visual representation of only sorting by the first column this follows with numbers representing column data:
1st Download:
1-1
1-2
1-3
2-1
2-2
2-3
3-1
3-2
3-3
2nd Download:
1-1
1-3
1-2
2-2
2-1
2-3
3-3
3-2
3-1
So what I have in mind is the process will be like this, download the CSV file from the source and then perform a sort on the lines in that CSV file to normalize them for comparison to one another before writing the TSV or HTML files. This should allow for accurate comparison for updated data files. but I didn't know how to do this my logic is like this
So I will put the function between the 1. and 2. before it process the CSV file into TSV File I want the content of the CSV is already sorted.
So my script is looking like this
my $download_dir_link ="C:/Users/jabella/Downloads";
unlink("$download_dir_link/Product Classification List.csv");
#CHECK IF CSV FILE DOWNLOAD IS FINISHED
my $complete_download_flag = 0;
while($complete_download_flag == 0)
{
my #download_directory = read_dir($download_dir_link);
foreach my $downloaded_file (#download_directory)
{
if($downloaded_file =~ /\QProduct Classification List.csv\E/sgi)
{
$complete_download_flag = 1;
}
}
sleep(5);
}
#SORTED CONTENTS OF CSV BEFORE CONVERSION
print "sORTING csv content...\n";
#CONVERT CSV TO TSV
print "Converting csv to tsv...\n";
my $csv = Text::CSV->new ({ binary => 1 });
my $tsv = Text::CSV->new ({ binary => 1, sep_char => "\t", eol => "\n"});
open my $infh, "<:encoding(utf8)", "$download_dir_link/Product Classification List.csv";
open my $outfh, ">:encoding(utf8)", "Product Classification List.tsv";
while (my $row = $csv->getline ($infh))
{
$tsv->print ($outfh, $row);
}
close($infh);
close($outfh);
my $tsv_content = "";
open(my $fh, '<', "Product Classification List.tsv");
while (<$fh>)
{
$tsv_content = $tsv_content.$_;
}
close($fh);
print "Conversion complete! cleaning tsv content...\n";
#CLEAN TSV CONTENT
$tsv_content =~ s/(.*?)\t"(.*?)"\t"(.*?)"\t"(.*?)"\t(.*?)\t"(.*?)"\t(.*)/<tr><th>$1<\/th><th>$2<\/th><th>$3<\/th><th>$4<\/th><th>$5<\/th><th>$6<\/th><th>$7<\/th><\/tr>/gi;
$tsv_content =~ s/"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\n/<tr><td>$1<\/td><td>$2<\/td><td>$3<\/td><td>$4<\/td><td>$5<\/td><td>$6<\/td><td>$7<\/td><\/tr>\n/gi;
$tsv_content =~ s/\"{2}/\"/sgi;
$tsv_content =~ s/(<\/tr>)\n?"/$1/sgi;
$tsv_content =~ s/\s{2,}/ /sgi;
$tsv_content =~ s/.*?(<tr>)/$1/si;
$tsv_content = "<table>\n$tsv_content</table>";
$classification =~ s/_//sgi;
if(exists $existing_index_hash{$doc_uid."_pind.html"})
{
if($existing_index_hash{$doc_uid."_pind.html"} ne $tsv_content)
{
$changed_flag = "1";
$updated_files = $updated_files."-$classification\n";
print "Updated: $classification\n";
Hope someone here can help me on this thank you
Here is a simple script that loads a CSV file specified as an argument and outputs it sorted by the first two columns.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = 'Text::CSV_XS'->new({binary => 1, auto_diag => 1});
open my $in, '<', shift or die $!;
my #rows;
while (my $row = $csv->getline($in)) {
push #rows, $row;
}
# Here the sorting happens. Compare the first column,
# if the values are the same, compare the second column.
#rows = sort { $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] } #rows;
$csv->say(*STDOUT, $_) for #rows;
You can use the following to sort by all columns (but it compares the values as strings, it doesn't work for numbers):
sub by_all {
my ($n, $A, $B) = #_;
$A->[$n] cmp $B->[$n]
|| $n < $#$A && by_all($n + 1, $A, $B)
}
sort { by_all(0, $a, $b) } #rows;
To make it work for numbers, too, you can let Perl guess what is a number:
use Scalar::Util qw{ looks_like_number };
sub by_all {
my ($n, $A, $B) = #_;
(looks_like_number($A->[$n])
? $A->[$n] <=> $B->[$n]
: $A->[$n] cmp $B->[$n]
) || $n < $#$A && by_all($n + 1, $A, $B)
}

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

I am currently trying to improve upon an existing mechanism (to compare data from 2 sources, implemented in perl5) and would like to use perl6 instead.
My target data volume range is about 20-30 GB in uncompressed flat files.
In terms of lines, a file can contain anywhere from 18 million to 28 million lines.
It has around 40-50 columns per line.
I do this type of data reconciliation on a daily basis and it can take about ~10 minutes to read from a file and populate the hash. ~20 minutes spent to read both files and to populate hash.
comparison process takes about ~30-50 minutes including iterating over hash, collecting desired result(s), and writing to output file (csv,psv).
All in all it can take anywhere between 30 minutes to 60 minutes on a 32 core dual xeon cpu server with 256gb of RAM, including intermittent server load, to perform the process.
Now I am trying to bring down the total processing time even further.
Here is my current single threaded approach using perl5.
fetch data from 2 sources (let's say s1 and s2) one by one and populate my hash based on key-value pairs. Source of data could be either a flat csv or psv file OR a database query Array of Array result, via DBI client. Data is always unsorted to start with.
To be specific, I read the file line by line,split fields, and choose desired indexes for key,value pair and insert into hash.
After collecting data and populating hash with desired key/value pairs,I start to compare and collect results (mainy comparing on what is missing or different in s2 w.r.t s1 and vice-versa).
dump output in an excel file (very costly if no. of lines is large like ~1 million or greater) or in a simple CSV (cheap operation. preferred method).
I was wondering whether if I could somehow do the first step in parallel i.e. collect data from both sources at once and populate my global hash, and then proceed to compare and dump output?
What options can perl6 provide to deal with this situation? I have read about concurrency, asynchronous and parallel operations using perl6 but I am not so certain which one can help me here.
I would really appreciate any general guidance on the matter. I hope I explained my problem well but sadly I don't have much to show for what have I tried till now? and reason is that I am just beginning to tackle this one. I am just unable to see past the single threaded approach and need some help.
Thanks.
EDIT
As my existing problem statement has been deemed by the community as 'too broad' - allow me to attempt to highlight my pain points below:
I would like to do file comparison by utilizing all 32 cores if possible. I am just not able to come up with a strategy or initial idea.
What type of new techniques are available or applicable with perl6 in order to tackle this problem or type of problem.
If I spawn 2 processes to read file(s) and collect data - is it possible to get the result back as an array or hash?
Is it possible to compare the data (stored in hash) in parallel?
My current p5 comparison logic is shown below for your reference. Hope this helps and not let this question shutdown.
package COMP;
use strict;
use Data::Dumper;
sub comp
{
my ($data,$src,$tgt) = #_;
my $result = {};
my $ms = ($result->{ms} = {});
my $mt = ($result->{mt} = {});
my $diff = ($result->{diff} = {});
foreach my $key (keys %{$data->{$src}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$tgt}{$key}) {
push (#{$mt->{$key}}, "$src_val|NULL");
}
if (exists $data->{$tgt}{$key} && $src_val ne $tgt_val) {
push (#{$diff->{$key}}, "$src_val|$tgt_val")
}
}
foreach my $key (keys %{$data->{$tgt}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$src}{$key}) {
push (#{$ms->{$key}},"NULL|$tgt_val");
}
}
return $result;
}
1;
If someone would like to try it out, here is the sample output and the test script used.
script output
[User#Host:]$ perl testCOMP.pl
$VAR1 = {
'mt' => {
'Source' => [
'source|NULL'
]
},
'ms' => {
'Target' => [
'NULL|target'
]
},
'diff' => {
'Sunday_isit' => [
'Yes|No'
]
}
};
Test Script
[User#Host:]$ cat testCOMP.pl
#!/usr/bin/env perl
use lib $ENV{PWD};
use COMP;
use strict;
use warnings;
use Data::Dumper;
my $data2 = {
f1 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'Yes',
Source => 'source',
},
f2 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'No',
Target => 'target',
},
};
my $result = COMP::comp ($data2,'f1','f2');
print Dumper $result;
[User#Host:]$
If you have an existing and working toolchain you don't have to rewrite it all to use Perl6. It's parallelism mechanisms work fine with external processess too. Consider
allnum.pl6
use v6;
my #processes =
[ "num1.txt", "num2.txt", "num3.txt", "num4.txt", "num5.txt" ]
.map( -> $filename {
[ $filename, run "perl", "num.pl", $filename, :out ];
})
.hyper;
say "Lazyness Here!";
my $time = time;
for #processes
{
say "<{$_[0]} : {$_[1].out.slurp}>";
}
say time - $time, "s";
num.pl
use warnings;
use strict;
my $file = shift #ARGV;
my $start = time;
my $result = 0;
open my $in, "<", $file or die $!;
while (my $thing = <$in>)
{
chomp $thing;
$thing =~ s/ //g;
$result = ($result + $thing) / 2;
}
print $result, " : ", time - $start, "s";
On my system
C:\Users\holli\tmp>perl6 allnum.pl6
Lazyness Here!
<num1.txt : 7684.16347578616 : 3s>
<num2.txt : 3307.36261498186 : 7s>
<num3.txt : 5834.32817942962 : 10s>
<num4.txt : 6575.55944995197 : 0s>
<num5.txt : 6157.63100049619 : 0s>
10s
Files were set up like so
C:\Users\holli\tmp>perl -e "for($i=0;$i<10000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num1.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<20000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num2.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<30000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num3.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<400000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num4.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<5000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num5.txt

Compare strings and remove more general pattern in Perl

I have an array of strings that numbers, possibly separated by forward slashes, e.g. 754 or 754/128. These strings can have an undefined length, in other words: something such as the following is possible: 1234/34/21/120/3. In the array I want to only keep the more specialised patterns that contain other patterns. For instance, in the first example above 754/128 contains 754, so 754 can be removed from the array.
This concept of containing is as broad as one would expect, maybe even broader: it is similar to how you look at a directed graph where each slash in the pattern refers to a step forward. the contained pattern can be of arbitrary length as long as it is inside the containing pattern one way or another. This means that the small path can occur in any (chronologically correct) form. For instance, 903/900 is contained in 903/902/900 even though the pattern is 'split open'. One way to visualise this is: in the small path we go from point A to point B. In the larger path we also go from pont A to B but we make a stop at C. The larger path visits more places than the small path without missing out on anything. As a consequence, the smaller path can occur in any split-up form - as long as the order of the path is respected. As an example:
2/5 - 1/2/3/4/5
# included
5/2 - 1/2/3/4/5
# not included
What I mean here is that the position of the 'contained' items should be identical in the large path. For instance: 1/3/2 'matches' in 1/5/3/4/2 because the order is identical in the small and large path: 1 comes in a position before 3, which in turn is in some position before 2. 1/2/3, 2/1/3 and so on would NOT match the larger path 1/5/3/4/2 even though they are valid paths with the same items. This is because the order of occurrence is different.
The above example also illustrates that the items in the small pattern can occur any where in the large pattern; not only in the first and last position or in subsequent positions. In other words, all included paths for 1/2/3/4 are:
1/2
1/2/3
1/3
1/4
2/3
2/3/4
2/4
3/4
I am looking for an efficient way to remove paths in a given array that are included in others from that same array.
I got this far, but I'm not sure how I should efficiently check the contains relationship between two items.
#!/usr/bin/perl
my #arr = ("903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903");
my #res = ();
OUTER: for (my $i = 0; $i < #arr; $i++) {
my $first = $arr[$i];
my $j = $i+1;
INNER: while($j < #arr) {
my $second = $arr[$j];
&compare_paths($first, $second);
$j++;
}
}
sub compare_paths {
my ($first, $second) = #_;
#first_items = split(/\//, $first);
#second_items = split(/\//, $second);
# Compare values from 1 and 2
}
The expected output for the code above would be
#res = ("903/904/902/901", "903/904/902/908/900");
Reasons for removal:
903/900 contained in 903/902/900
903/902/900 contained in 903/904/902/908/900
903 contained in 903/904/902/901
How can I implement such algorithm efficiently? My main idea would be to check if an item of #first_items is a present in $second, if not continue, but if so check if a second item is also present and if so: check its substring position. This has to be larger than the first items substring position. Continue for each item (and the other way around for #second_items and $first) until all strings are matched. (If it helps for speed, the initial array can be swapped for a hash with the former array as the keys.)
I expect that there are general algorithms that can account for this problem, and probably libraries that can be leveraged. However, here is a hand-rolled one.
First, we sort the array by the number of terms in the path . Then we go up that array, comparing each element with all longer ones. This way each path is excluded at earliest opportunity.
The comparison is between arrays obtained by splitting on /. It checks whether all elements of the smaller array are in the larger one as an exact sub-sequence, so that the larger one would yield the smaller one by only removing elements (without rearrangements).
use warnings;
use strict;
my #arr = qw(902/904 903/900 903/902/900 903/904/902/901
903/904/902/908/900 903);
my #sorted = sort { (split '/', $a) > (split '/', $b) } #arr;
my #primes;
OUTER:
for my $i (0..$#sorted) {
for my $j ($i+1..$#sorted) {
next OUTER if is_contained($sorted[$i], $sorted[$j]);
}
push #primes, $sorted[$i];
}
print "#primes\n";
sub is_contained
{
my ($small, $large) = #_;
my #small = split '/', $small;
my #large = split '/', $large;
# There can be no duplicates so equal-length paths are distinct
return 0 if #small == #large;
# Indices of elements of #small in #large cannot decrease
my ($match, $index) = (0, 0);
for my $sm (#small) {
for my $i (0..$#large) {
$sm == $large[$i] || next;
return 0 if $i < $index; # out of order
$index = $i;
$match = 1;
last;
}
return 0 if not $match; # $sm from #small not in #large
$match = 0;
}
return 1;
}
Prints the line: 902/904 903/904/902/901 903/904/902/908/900.
A note on how we check that the #smaller matches a subsequence in #larger.
Once a #smaller element is found in #larger, its index in #larger
cannot be lower than the one found previously. An element must come after the previous one, not before. See a different procedure below.
So with 2/7/5 and 1/2/5/7/8, first 2 is found at index 1, then 7 at index 3, then 5 but at index 2. The subsequence 2-5-7 does not match 2-7-5. I added 902/904 to data to test for this.
This is an alternate procedure for checking whether a path is contained in another.
Once it finds an element of #smaller in #larger it searches for the next one starting from the next index in #larger. This way it skips the searched part of the path, but it cannot detect out-of-order elements early either.
With the example of 2/7/5 and 1/2/5/7/8, after it found 7 at index 3 it starts from index 4 and detects failure by not finding a 5 in the rest of the target path.
sub is_contained_2
{
my #large = split '/', $_[0];
my #small = split '/', $_[1];
# Is #small found in #large as an exact sub-sequence?
my ($match, $j) = (0, 0);
for my $sm (#small) {
for my $i ($j..$#large) {
$sm == $large[$i] || next;
$j = $i+1, $match = 1;
last;
}
return 0 if not $match;
$match = 0;
}
return 1;
}
This is slower (by 10-15%) for this data set, see benchmark with a comment below.
I benchmarked the two array-based versions here and ikegami's regex+trie. So far I have used only the specific data set from the question, with 902/904 added.
use warnings;
use strict;
use Benchmark qw(cmpthese);
my $secs_to_run = shift || 10;
my #arr = ('902/904', '903/900', '903/902/900', '903/904/902/901',
'903/904', '/902/908/900', '903');
# sorted array checked shorter-to-longer, manual iterations
sub contained {
my ($rarr) = #_; my #arr = #$arr;
# program copied from this post
return \#primes;
}
sub is_contained { ... } # copied
# Same program, but using is_contained_2()
sub contained_2 { ... }
sub is_contained_2 { ... }
# Regex-trie, copied from ikegami's post
sub add { my $p = \shift; $p = \( $$p->{$_} ) for #_, ''; }
sub as_pat { my $trie = shift; ... } # copied
sub regex_trie {
my ($rpaths) = #_; my #paths = #$rpaths;
# program copied from ikegami's post
return \#filtered_paths;
}
cmpthese(-$secs_to_run, {
containted => sub { my $rprimes = contained(\#arr) },
cont_next => sub { my $rprimes = contained_2(\#arr) },
regex_trie => sub { my $rfiltered = regex_trie(\#arr) },
});
With bench_cont.pl 300, on a newer workstation-laptop (2.5GHz) with v5.16
Rate regex_trie cont_next containted
regex_trie 15264/s -- -15% -27%
cont_next 17946/s 18% -- -14%
containted 20939/s 37% 17% --
on an older server (2.8GHz) with v5.16
Rate regex_trie cont_next containted
regex_trie 11750/s -- -13% -27%
cont_next 13537/s 15% -- -16%
containted 16042/s 37% 19% --
on an older server (3.5GHz) with v5.10
Rate cont_next regex_trie containted
cont_next 12266/s -- -17% -17%
regex_trie 14832/s 21% -- -0%
containted 14845/s 21% 0% --
This surprised me, as I expected the regex-based solution to be fastest.
I expect the trend to reverse for data composed of longer paths, having more distinct (not contained) paths, with containment found later in the path, and with a few out-of-order dismissals.
I'll add tests once I get to generate such data, or once it is provided.
To track some of the processing change the body to
use feature 'say';
OUTER:
for my $i (0..$#sorted) {
say "Check $sorted[$i]";
for my $j ($i+1..$#sorted) {
my $is_inside = is_contained($sorted[$i], $sorted[$j]);
say "\t$is_inside: $sorted_arr[$i] inside $sorted_arr[$j]";
next OUTER if $is_inside;
}
push #primes, $sorted[$i];
}
say "\nNot contained: #primes";
This prints
Check 903
0: 903 vs. 902/904
1: 903 vs. 903/900
Check 902/904
0: 902/904 vs. 903/900
0: 902/904 vs. 903/902/900
0: 902/904 vs. 903/904/902/901
0: 902/904 vs. 903/904/902/908/900
Check 903/900
1: 903/900 vs. 903/902/900
Check 903/902/900
0: 903/902/900 vs. 903/904/902/901
1: 903/902/900 vs. 903/904/902/908/900
Check 903/904/902/901
0: 903/904/902/901 vs. 903/904/902/908/900
Check 903/904/902/908/900
Not contained: 902/904 903/904/902/901 903/904/902/908/900
To optimize what you have so far, I'd suggest that you pre-split all elements of the array (and then re-merge them later):
#arr = map [split "/", $_], #arr;
With that done:
sub contains(\#\#) {
my ($larger_ref, $smaller_ref) = #_;
return '' if #$larger_ref <= #$smaller_ref;
my ($i, $j) = 0;
while ($i < #$larger_ref && $j <= #$smaller_ref) {
++$j if $larger_ref->[$i] == $smaller_ref->[$j];
++$i;
}
return $j == #$smaller_ref;
}
I: for (my $i = 0; $i < #arr; ++$i) {
J: for (my $j = 0; $j < #arr; ++$j) {
next J if $j == $i;
next I if contains #{$arr[$j]}, #{$arr[i]};
}
push #res, join '/', #{$arr[$i]};
}
There are some potential further optimizations you can make in contains (for example, it might make sense to abort early if/when #$larger_ref - $i < #$smaller_ref - $j), but you'd want to test: they might turn out to be pessimizations.
If a/b/c is path, you want to remove the paths for which the following is true:
"/$path" =~ m{ ^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z }x && $path ne 'a/b/c'
That can also be written as
"/$path" =~ m{ ^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c ) }x
If both a/b/c, a/i/j, a/x/y and d/e/f are paths, you want to remove the paths for which the following is true:
"/$path" =~ m{
^ (?:/a)?+ (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c )
| ^ (?:/a)?+ (?:/i)?+ (?:/j)?+ \z (?<! ^ /a/i/j )
| ^ (?:/a)?+ (?:/x)?+ (?:/y)?+ \z (?<! ^ /a/x/y )
| ^ (?:/d)?+ (?:/e)?+ (?:/f)?+ \z (?<! ^ /d/e/f )
}x
We removed alot of the backtracking by using the possessive modifier (+), but there is still a possibility for backtracking because of the common prefixes. So let's remove them!
"/$path" =~ m{
^
(?: (?:/a)?+ (?: (?:/b)?+ (?:/c)?+ \z (?<! ^ /a/b/c )
| (?:/i)?+ (?:/j)?+ \z (?<! ^ /a/i/j )
| (?:/x)?+ (?:/y)?+ \z (?<! ^ /a/x/y )
)
| (?:/d)?+ (?:/e)?+ (?:/f)?+ \z (?<! ^ /d/e/f )
)
}x
Now we have an efficient solution!
The following uses a trie to do this to remove the common prefixes.
use strict;
use warnings;
use feature qw( say );
sub add {
my $p = \shift;
$p = \( $$p->{$_} ) for #_, '';
}
sub as_pat {
my $trie = shift;
my #sub_pats =
map { $_ eq '' ? '' : $_ . as_pat($trie->{$_}) }
keys(%$trie);
if (#sub_pats == 1) {
return $sub_pats[0];
} else {
return '(?:'.join('|', #sub_pats).')';
}
}
my #paths = ( "903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903" );
my $trie;
add($trie, map({ "(?:/\Q$_\E)?+" } split qr{/}), "\\z(?<!^/\Q$_\E)" )
for #paths;
my $pat = as_pat($trie);
my $re = qr{^(?:$pat)};
my #filtered_paths = grep { "/$_" !~ /$re/ } #paths;
say for #filtered_paths;
Say N is the number of paths, and O(M) bounds the length of the paths. Like earlier answers, this one completes in O(N2 * M2) time, so it doesn't scale any better. However, I believe you will find mine faster should you benchmark them.
Say N is the number of paths, and O(M) bounds the length of the paths, the earlier answers complete in O(N2 * M2) time.
The following solution is O(N * 2M). This means it can handle very large numbers of paths much much more efficiently, as long as the paths are rather short (because it effectively becomes O(N2) vs O(N)). It does require far more memory than the solutions in the earlier answers.
use strict;
use warnings;
use feature qw( say );
sub fetch {
my $trie = shift;
for (#_, '') {
return () if !$trie;
$trie = $trie->{$_}
}
return $trie;
}
sub add {
local *_helper = sub {
my $trie_ptr = \shift;
my $exact = shift;
if (#_) {
my $lead = shift(#_);
_helper($$trie_ptr->{$lead}, $exact, #_);
_helper($$trie_ptr, 0, #_);
} else {
if ($exact) {
$$trie_ptr->{''} ||= 1;
} else {
$$trie_ptr->{''} = 2;
}
}
};
my $trie_ptr = \shift;
return _helper($$trie_ptr, 1, #_);
}
my #paths = ( "903/900", "903/902/900", "903/904/902/901", "903/904/902/908/900", "903" );
my #split_paths = map [ split qr{/} ], #paths;
my $trie;
add($trie, #$_)
for #split_paths;
use Data::Dumper qw( Dumper );
local $Data::Dumper::Sortkeys = 1;
print(Dumper($trie));
my #filtered_paths =
map join('/', #$_),
grep { fetch($trie, #$_) == 1 }
#split_paths;
say for #filtered_paths;

Need simple parallelism example in Perl 6

I am trying to learn Perl 6 and parallelism/concurrency at the same time.
For a simple learning exercise, I have a folder of 550 '.htm' files and I want the total sum of lines of code among all of them. So far, I have this:
use v6;
my $start_time = now;
my $exception;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
for #files -> $file {
$total_lines += $file.lines.elems;
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
say $total_lines;
say now - $start_time;
That gives a sum of 577,449 in approximately 3 seconds.
How would I rewrite that to take advantage of Perl 6 parallelism ideas? I realize the time saved won't be much but it will work as proof of concept.
Implementing Christoph's suggestion. The count is slightly higher than my original post because I'm now able to read in the malformed UTF-8 files using encode latin1.
use v6;
my $start_time = now;
my #files = "c:/iforms/live".IO.dir(test => / '.' htm $/);
my $total_lines = [+] #files.race.map(*.lines(:enc<latin1>).elems);
say $total_lines;
say now - $start_time;
use v6;
my $start_time = now;
my $exception;
my #lines;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
await do for #files -> $file {
start {
#lines.push( $file.lines.elems );
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
}
$total_lines = [+] #lines;
say $total_lines;
say now - $start_time;

Parsing large XML files?

I have 2 xml files 1 with 115mb size and another with 34mb size.
Wiile reading file A there is 1 field called desc that relations it with file B where I retrieve the field id from file B where desc.file A is iqual to name.file B.
file A is already too big then I have to search inside file B and it takes a very long time to complete.
How could I speed up this proccess or what would be a better approch to do it ?
current code I am using:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Simple qw(:strict XMLin);
my $npcs = XMLin('Client/client_npcs.xml', KeyAttr => { }, ForceArray => [ 'npc_client' ]);
my $strings = XMLin('Client/client_strings.xml', KeyAttr => { }, ForceArray => [ 'string' ]);
my ($nameid,$rank);
open (my $fh, '>>', 'Output/npc_templates.xml');
print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
foreach my $npc ( #{ $npcs->{npc_client} } ) {
if (defined $npc->{desc}) {
foreach my $string (#{$strings->{string}}) {
if (defined $string->{name} && $string->{name} =~ /$npc->{desc}/i) {
$nameid = $string->{id};
last;
}
}
} else {
$nameid = "";
}
if (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 25 && $npc->{hpgauge_level} < 28) {
$rank = 'LEGENDARY';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 21 && $npc->{hpgauge_level} < 23) {
$rank = 'HERO';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 10 && $npc->{hpgauge_level} < 15) {
$rank = 'ELITE';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 0 && $npc->{hpgauge_level} < 11) {
$rank = 'NORMAL';
} else {
$rank = $gauge;
}
print $fh qq|\t<npc_template npc_id="$npc->{id}" name="$npc->{name}" name_id="$nameid" height="$npc->{scale}" rank="$rank" tribe="$npc->{tribe}" race="$npc->{race_type}" hp_gauge="$npc->{hpgauge_level}"/>\n|;
}
print $fh "</<npc_templates>";
close($fh);
example of file A.xml:
<?xml version="1.0" encoding="utf-16"?>
<npc_clients>
<npc_client>
<id>200000</id>
<name>SkillZone</name>
<desc>STR_NPC_NO_NAME</desc>
<dir>Monster/Worm</dir>
<mesh>Worm</mesh>
<material>mat_mob_reptile</material>
<show_dmg_decal>0</show_dmg_decal>
<ui_type>general</ui_type>
<cursor_type>none</cursor_type>
<hide_path>0</hide_path>
<erect>1</erect>
<bound_radius>
<front>1.200000</front>
<side>3.456000</side>
<upper>3.000000</upper>
</bound_radius>
<scale>10</scale>
<weapon_scale>100</weapon_scale>
<altitude>0.000000</altitude>
<stare_angle>75.000000</stare_angle>
<stare_distance>20.000000</stare_distance>
<move_speed_normal_walk>0.000000</move_speed_normal_walk>
<art_org_move_speed_normal_walk>0.000000</art_org_move_speed_normal_walk>
<move_speed_normal_run>0.000000</move_speed_normal_run>
<move_speed_combat_run>0.000000</move_speed_combat_run>
<art_org_speed_combat_run>0.000000</art_org_speed_combat_run>
<in_time>0.100000</in_time>
<out_time>0.500000</out_time>
<neck_angle>90.000000</neck_angle>
<spine_angle>10.000000</spine_angle>
<ammo_bone>Bip01 Head</ammo_bone>
<ammo_fx>skill_stoneshard.stoneshard.ammo</ammo_fx>
<ammo_speed>50</ammo_speed>
<pushed_range>0.000000</pushed_range>
<hpgauge_level>3</hpgauge_level>
<magical_skill_boost>0</magical_skill_boost>
<attack_delay>2000</attack_delay>
<ai_name>SummonSkillArea</ai_name>
<tribe>General</tribe>
<pet_ai_name>Pet</pet_ai_name>
<sensory_range>15.000000</sensory_range>
</npc_client>
</npc_clients>
example of file B.xml:
<?xml version="1.0" encoding="utf-16"?>
<strings>
<string>
<id>350000</id>
<name>STR_NPC_NO_NAME</name>
<body> </body>
</string>
</strings>
Here is example of XML::Twig usage. The main advantage is that it is not holding whole file in memory, so processing is much faster. The code below is trying to emulate operation of script from question.
use XML::Twig;
my %strings = ();
XML::Twig->new(
twig_handlers => {
'strings/string' => sub {
$strings{ lc $_->first_child('name')->text }
= $_->first_child('id')->text
},
}
)->parsefile('B.xml');
print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
XML::Twig->new(
twig_handlers => {
'npc_client' => sub {
my $nameid = eval { $strings{ lc $_->first_child('desc')->text } };
# calculate rank as needed
my $hpgauge_level = eval { $_->first_child('hpgauge_level')->text };
$rank = $hpgauge_level >= 28 ? 'ERROR'
: $hpgauge_level > 25 ? 'LEGENDARY'
: $hpgauge_level > 21 ? 'HERO'
: $hpgauge_level > 10 ? 'ELITE'
: $hpgauge_level > 0 ? 'NORMAL'
: $hpgauge_level;
my $npc_id = eval { $_->first_child('id')->text };
my $name = eval { $_->first_child('name')->text };
my $tribe = eval { $_->first_child('tribe')->text };
my $scale = eval { $_->first_child('scale')->text };
my $race_type = eval { $_->first_child('race_type')->text };
print
qq|\t<npc_template npc_id="$npc_id" name="$name" name_id="$nameid" height="$scale" rank="$rank" tribe="$tribe" race="$race_type" hp_gauge="$hpgauge_level"/>\n|;
$_->purge;
}
}
)->parsefile('A.xml');
print "</<npc_templates>";
Grab all the interesting 'desc' fields from file A and put them in a hash. You only have to parse it once, but if it still takes too long have a look at XML::Twig.
Parse file B. once and extract the stuff you need. Use the hash.
Looks like you only need parts of the xml files. XML::Twig can parse only the elements you are interested in and throw away the rest using the "twig_roots" parameter. XML::Simple is easier to get started with though..
Although I can't help you with the specifics of your Perl code, there are some general guidelines when dealing with large volumes of XML data. There are, broadly speaking, 2 kinds of XML APIs - DOM based and Stream based. Dom based API's (like XML DOM) will parse an entire XML document in to memory before the user-level API becomes "available", whereas with a stream based API (like SAX) the implementation doesn't need to parse the whole XML document. One benefit of Stream based parsers are that they typically use much less memory, as they don't need to hold the entire XML document in memory at once - this is obviously a good thing when dealing with large XML documents. Looking at the XML::Simple docs here, it's seems there may be SAX support available - have you tried this?
I'm not a perl guy, so take this with a grain of salt, but I see 2 problems:
The fact that you are iterating over all of the values in file B until you find the correct value for each element in file A is inefficient. Instead, you should be using some sort of map/dictionary for the values in file B.
It looks like you are parsing the both files in memory before you even start processing. File A would be better processed as a stream instead of loading the entire document into memory.

Resources