I am trying to learn Perl 6 and parallelism/concurrency at the same time.
For a simple learning exercise, I have a folder of 550 '.htm' files and I want the total sum of lines of code among all of them. So far, I have this:
use v6;
my $start_time = now;
my $exception;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
for #files -> $file {
$total_lines += $file.lines.elems;
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
say $total_lines;
say now - $start_time;
That gives a sum of 577,449 in approximately 3 seconds.
How would I rewrite that to take advantage of Perl 6 parallelism ideas? I realize the time saved won't be much but it will work as proof of concept.
Implementing Christoph's suggestion. The count is slightly higher than my original post because I'm now able to read in the malformed UTF-8 files using encode latin1.
use v6;
my $start_time = now;
my #files = "c:/iforms/live".IO.dir(test => / '.' htm $/);
my $total_lines = [+] #files.race.map(*.lines(:enc<latin1>).elems);
say $total_lines;
say now - $start_time;
use v6;
my $start_time = now;
my $exception;
my #lines;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
await do for #files -> $file {
start {
#lines.push( $file.lines.elems );
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
}
$total_lines = [+] #lines;
say $total_lines;
say now - $start_time;
Related
I have two txt files, then I put them into hash'es, sequence => title.
In file DUOMENYS.txt title is known, in file "DUOTA.txt" title is unknown.
So for every sequence in file "DUOTA.txt" I need to find similar sequence in file DUOMENYS.txt and then print that known title.
I tried to make this with slimple matching, printing title with more than 90% sequence symbol matches, but I was told that it is wrong, and I have to do it other way, with this table: http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt
I have to compare letters from known sequence and unknown sequence and get number
(-4)-9, if sum of all that numbers => length of sequence * 3, print that title
Example, ABW => TILE1, DUOMENYS.txt, ABF => UNKNOWNTITLE, DUOTA.txt,
A B W
A B F
4 4 1 sum = 9
length 3 x 3 = 9
9 => 9, true, print.
So the problem is I don't know how to make it happen....
#!/usr/bin/perl
use strict;
use Data::Dumper;
open (OUTPUT, ">OUTPUT.txt") or die "$!"; #Turimos vairenio sekos
open (DUOMENYS, "DUOMENYS.txt") or die "$!";
open (OUTPUT1, ">OUTPUT1.txt") or die "$!"; #Tiriamos sekos
open (DUOTA, "DUOTA.txt") or die "$!";
open (OUTPUT2, ">OUTPUT2.txt") or die "$!"; #rezultatai
open (MATRIX, "MATRIX.txt") or die "$!";
#--------------------DUOMENYS-HASH-----------------------------
#my $contentFile = $ARGV[0];
my $contentFile = <DUOMENYS>;
my %testHash = ();
my $currentKey = "";
my $seka = "";
my %nhash = ();
open(my $contentFH,"<",$contentFile);
while(my $contentLine = <DUOMENYS>){
chomp($contentLine);
next if($contentLine eq ""); # Empty lines.
if($contentLine =~ /^\>(.*)/){
$testHash{$currentKey} = $seka;
$currentKey= $1;
$seka = "";
}else{
$seka .= $contentLine;
}
}
#-------------------DUOTA-HASH-------------------------------------
#my $contentFile1 = $ARGV[0];
my $contentFile1 = <DUOTA>;
my %testHash1 = ();
my $currentKey1 = "";
my $seka1 = "";
my %nhash1 = ();
open(my $contentFH1,"<",$contentFile1);
while(my $contentLine1 = <DUOTA>){
chomp($contentLine1);
next if($contentLine1 eq ""); # Empty lines.
if($contentLine1 =~ /^\>(.*)/){
$testHash1{$currentKey1} = $seka1;
$currentKey1= $1;
$seka1 = "";
}else{
$seka1 .= $contentLine1;
}
}
#--------------------OUTPUT-HASH------------------------------------
%nhash = reverse %testHash;
print OUTPUT Dumper(\%nhash);
%nhash1 = reverse %testHash1;
print OUTPUT1 Dumper(\%nhash1);
#---------------------MATCHING---------------------------------------
my $klaidu_skaicius = 0;
my #sekos = keys %nhash;
my #duotos_sekos = keys %nhash1;
my $i = 0;
my $j = 0;
for($i = 0; $i <= scalar#sekos; $i++){
for($j = 0; $j <= scalar#duotos_sekos; $j++){
$klaidu_skaicius = (#sekos[$i] ^ #duotos_sekos[$j]) =~ tr/\0//c;
if($klaidu_skaicius <= length(#sekos[$i])/10){
print OUTPUT2 substr( $nhash{#sekos[$i]}, 0, 9 ), "\n";
}
else{
print OUTPUT2 "";
}
}
}
From comments
pastebin.com/7QnBDTDY – povilito May 30 at 11:57
Its too big (15mb) for pastebin.com – povilito May 30 at 12:01
filedropper.com/duomenys – povilito May 30 at 12:04
I think comparing "letter" with " "(space) or "" should give us number -4 – povilito May 30 at 12:28
It's enougth to find one title for one unknown sequence. – povilito May 30 at 12:45
So if there is 50 unknown sequences, output file should give us 50 titles, some tittles can be the same :)
Here is the basic version of solution to your problem, as per I understood. I considered both the sequence equal, you can make the changes easily if you need.
#!/usr/bin/perl
use strict;
use warnings;
# convert your matrix file to 2D hash
my %hash;
open my $fh, '<', 'matrix' or die "unable to open file: $!\n";
chomp(my $heading = <$fh>);
# strip out space from begining
$heading =~ s/^\s+//g;
my #headings = split (/\s+/, $heading);
while(<$fh>) {
chomp;
my #line = split;
my $key1 = shift #line;
foreach my $i( 0 .. $#line) {
$hash{$key1}{$headings[$i]} = $line[$i];
}
}
close $fh;
# Took from your example for reference
my $duameny = "ABW";
my $duota = "ABF";
my #duamenys = split (//,$duameny);
my #duotas = split (//,$duota);
# calculate sum from hash
# considering both sequences are equal
my $sum = 0;
foreach my $i (0 .. $#duamenys) {
$sum += $hash{$duamenys[$i]}{$duotas[$i]};
}
# calculate length from sequence
my $length = (length $duameny) * 3;
print "SUM: $sum, Length: $length\n";
if($sum >= $length) {
# probably you know how to print the title
# print the title from duamenys.txt
}
Below is summary of my approach.
First converted the matrix into 2D hash for lookup.
Calculated sum from the hash for sequences.
Calculated length from the sequences
Compare them and print the title if sum >= length .
output:
SUM: 9, Length: 9
I'm trying to download all fasta files associated with one organism from ncbi.
I tried wget -r -l3 -A "*.fna.gz" ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/ to get all files ending in .fna.gz from the third level down, but then it just rejects everything with the following output:
Removed “ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/latest_assembly_versions/.listing”.
Rejecting “GCF_000010625.1_ASM1062v1”.
Rejecting “GCF_000307995.1_ASM30799v2”.
Rejecting “GCF_000312165.1_ASM31216v1”.
Rejecting “GCF_000312185.1_ASM31218v1”.
Rejecting “GCF_000312205.1_ASM31220v1”.
Rejecting “GCF_000312225.1_ASM31222v1”.
Rejecting “GCF_000312245.1_ASM31224v1”.
Rejecting “GCF_000312265.1_ASM31226v1”.
Rejecting “GCF_000312285.1_ASM31228v1”.
Rejecting “GCF_000312725.1_ASM31272v1”.
Rejecting “GCF_000330925.1_MicAerT1.0”.
Rejecting “GCF_000332585.1_MicAerD1.0”.
Rejecting “GCF_000412595.1_spc777-v1”.
Rejecting “GCF_000599945.1_Mic70051.0”.
Rejecting “GCF_000787675.1_ASM78767v1”.
Rejecting “GCF_000981785.1_ASM98178v1”.
Any ideas on why it's rejecting these directories? Thanks for your help.
Not exactly sure why it's rejecting your request, but when I was still doing this type of thing, I found that if I don't download queries in smaller batches, the NCBI server timed me out and blocked my IP for a while before I could download again. This doesn't seem to be the same problem that your seeing, but maybe this script might get the same thing done. Let me know if this helps.
#!/usr/bin/env python
from Bio import Entrez
search_term = raw_input("Organism name: ")
Entrez.email = "your_email#isp.com" # required by NCBI
search_handle = Entrez.esearch(db="nucleotide", term=search_term, usehistory="y")
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results["IdList"]
count = int(search_results["Count"])
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
batch_size = 5 # download sequences in batches so NCBI doesn't time you out
with open("ALL_SEQ.fasta", "w") as out_handle:
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print "Going to download record %i to %i" % (start+1, end)
fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text",retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key)
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
print ("\nDownload completed")
I found a perl script that gets me close to accomplishing this task from here . Unfortunately, this script is just returning the ID's of the genomes, and not the actual sequences.
For example, the head of my output is:
gi|425458296|ref|NZ_CAIN00000000.1|NZ_CAIN01000000 Microcystis aeruginosa PCC 9808, whole genome shotgun sequencing project
gi|425448636|ref|NZ_CAIK00000000.1|NZ_CAIK01000000 Microcystis aeruginosa PCC 7941, whole genome shotgun sequencing project
Any perl users know what's going on?
use strict;
use LWP::Simple;
my ($name, $outname, $url, $xml, $out, $count, $query_key, $webenv, $ids);
my #genomeId;
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $limit = 'wgs[prop]+AND+srcdb+refseq[prop])';
my #species = ('Microcystis aeruginosa');
foreach my $s (#species) {
undef #genomeId;
$query_key = $webenv = '';
$s =~ s/ /+/g;
# ESearch
$url = $base . "esearch.fcgi?db=genome&term=$s";
$xml = get($url);
$count = $1 if ($xml =~ /<Count>(\d+)<\/Count>/);
if ($count > 30) {
$url = $base . "esearch.fcgi?db=genome&term=$s&retmax=$count";
$xml = get($url);
}
while ($xml =~ /<Id>(\d+?)<\/Id>/gs) {
push(#genomeId, $1);
}
$ids = join(',', #genomeId);
# ELink
$url = $base . "elink.fcgidbfrom=genome&db=nuccore&cmd=neighbor_history&id=$ids&term=$limit";
$xml = get($url);
$query_key = $1 if ($xml =~ /<QueryKey>(\d+)<\/QueryKey>/);
$webenv = $1 if ($xml =~ /<WebEnv>(\S+)<\/WebEnv>/);
# EFetch
$url = $base . "efetch.fcgidb=nuccore&query_key=$query_key&WebEnv=$webenv&rettype=fasta&retmode=text";
$out = get($url);
open (OUT, ">$s.fna");
close OUT;
}
Looking for a way to monitor a directory for a new file creation or a drop.
so if I have a folder c:\temp and if an abc.txt is copied/created in this I want an event or something so that I can pick up that file and then process it.
Also, I want continuous monitoring of this folder. How can I do that. I am writing a service which does all this. I want to incorporate monitoring and processing in one script.
Thanks in advance.
The answer is here: In Perl, how can I watch a directory for changes?
For Linux:
use File::ChangeNotify;
my $watcher = File::ChangeNotify->instantiate_watcher(
directories => [ 'archive/oswiostat' ],
filter => qr/\Aoracleapps[.].*dat\z/,
);
while (my #events = $watcher->wait_for_events) {
# ...
}
I think you are using Windows so you have to use Win32::ChangeNotify
example from: http://www.perlmonks.org/?node_id=306175
use strict;
use Win32::ChangeNotify;
our $PATH ||= '.';
our $S = defined $S ? 1 : 0;
my $notify = Win32::ChangeNotify->new( $PATH, $S, 'FILE_NAME' );
my %last; #last{ glob $PATH . '/*' } = ();
while( 1 ) {
print('Nothing changed'), next
unless $notify->wait( 10_000 ); # Check every 10 seconds
$notify->reset;
print 'Something changed';
my #files = glob $PATH . '/*';
if( #files < scalar keys %last ) {
delete #last{ #files };
print 'These files where deleted: ';
print for keys %last;
}
elsif( #files > scalar keys %last ) {
my %temp;
#temp{ #files } = ();
delete #temp{ keys %last };
print 'These files where created: ';
print for keys %temp;
}
else {
print "A non-deletion or creation change occured";
}
undef %last;
#last{ #files } = ();
}
I have few file similar to below, and I am trying to do numeric profiling as mentioned in the image
>File Sample
attttttttttttttacgatgccgggggatgcggggaaatttccctctctctctcttcttctcgcgcgcg
aaaaaaaaaaaaaaagcgcggcggcgcggasasasasasasaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
I have to map each substring of size 2 and then map it to 33 value for different ptoperties and then add as per the window size of 5.
my %temp = (
aCount => {
aa =>2
}
cCount => {
aa => 0
}
);
My current implementation include as per below ,
while (<FILE>) {
my $line = $_;
chomp $line;
while ($line=~/(.{2})/og) {
$subStr = $1;
if (exists $temp{aCount}{$subStr}) {
push #{$temp{aCount_array}},$temp{aCount}{$subStr};
if (scalar(#{$temp{aCount_array}}) == $WINDOW_SIZE) {
my $sum = eval (join('+',#{$temp{aCount_array}}));
shift #{$temp{aCount_array}};
#Similar approach has been taken to other 33 rules
}
}
if (exists $temp{cCount}{$subStr}) {
#similar approach
}
$line =~s/.{1}//og;
}
}
is there any other approach to increase the speed of the overall process
Regular expressions are awesome, but they can be overkill when all you need are fixed width substrings. Alternatives are substr
$len = length($line);
for ($i=0; $i<$len; $i+=2) {
$subStr = substr($line,$i,2);
...
}
or unpack
foreach $subStr (unpack "(A2)*", $line) {
...
}
I don't know how much faster either of these will be than regular expressions, but I know how I would find out.
I have 2 xml files 1 with 115mb size and another with 34mb size.
Wiile reading file A there is 1 field called desc that relations it with file B where I retrieve the field id from file B where desc.file A is iqual to name.file B.
file A is already too big then I have to search inside file B and it takes a very long time to complete.
How could I speed up this proccess or what would be a better approch to do it ?
current code I am using:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Simple qw(:strict XMLin);
my $npcs = XMLin('Client/client_npcs.xml', KeyAttr => { }, ForceArray => [ 'npc_client' ]);
my $strings = XMLin('Client/client_strings.xml', KeyAttr => { }, ForceArray => [ 'string' ]);
my ($nameid,$rank);
open (my $fh, '>>', 'Output/npc_templates.xml');
print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
foreach my $npc ( #{ $npcs->{npc_client} } ) {
if (defined $npc->{desc}) {
foreach my $string (#{$strings->{string}}) {
if (defined $string->{name} && $string->{name} =~ /$npc->{desc}/i) {
$nameid = $string->{id};
last;
}
}
} else {
$nameid = "";
}
if (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 25 && $npc->{hpgauge_level} < 28) {
$rank = 'LEGENDARY';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 21 && $npc->{hpgauge_level} < 23) {
$rank = 'HERO';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 10 && $npc->{hpgauge_level} < 15) {
$rank = 'ELITE';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 0 && $npc->{hpgauge_level} < 11) {
$rank = 'NORMAL';
} else {
$rank = $gauge;
}
print $fh qq|\t<npc_template npc_id="$npc->{id}" name="$npc->{name}" name_id="$nameid" height="$npc->{scale}" rank="$rank" tribe="$npc->{tribe}" race="$npc->{race_type}" hp_gauge="$npc->{hpgauge_level}"/>\n|;
}
print $fh "</<npc_templates>";
close($fh);
example of file A.xml:
<?xml version="1.0" encoding="utf-16"?>
<npc_clients>
<npc_client>
<id>200000</id>
<name>SkillZone</name>
<desc>STR_NPC_NO_NAME</desc>
<dir>Monster/Worm</dir>
<mesh>Worm</mesh>
<material>mat_mob_reptile</material>
<show_dmg_decal>0</show_dmg_decal>
<ui_type>general</ui_type>
<cursor_type>none</cursor_type>
<hide_path>0</hide_path>
<erect>1</erect>
<bound_radius>
<front>1.200000</front>
<side>3.456000</side>
<upper>3.000000</upper>
</bound_radius>
<scale>10</scale>
<weapon_scale>100</weapon_scale>
<altitude>0.000000</altitude>
<stare_angle>75.000000</stare_angle>
<stare_distance>20.000000</stare_distance>
<move_speed_normal_walk>0.000000</move_speed_normal_walk>
<art_org_move_speed_normal_walk>0.000000</art_org_move_speed_normal_walk>
<move_speed_normal_run>0.000000</move_speed_normal_run>
<move_speed_combat_run>0.000000</move_speed_combat_run>
<art_org_speed_combat_run>0.000000</art_org_speed_combat_run>
<in_time>0.100000</in_time>
<out_time>0.500000</out_time>
<neck_angle>90.000000</neck_angle>
<spine_angle>10.000000</spine_angle>
<ammo_bone>Bip01 Head</ammo_bone>
<ammo_fx>skill_stoneshard.stoneshard.ammo</ammo_fx>
<ammo_speed>50</ammo_speed>
<pushed_range>0.000000</pushed_range>
<hpgauge_level>3</hpgauge_level>
<magical_skill_boost>0</magical_skill_boost>
<attack_delay>2000</attack_delay>
<ai_name>SummonSkillArea</ai_name>
<tribe>General</tribe>
<pet_ai_name>Pet</pet_ai_name>
<sensory_range>15.000000</sensory_range>
</npc_client>
</npc_clients>
example of file B.xml:
<?xml version="1.0" encoding="utf-16"?>
<strings>
<string>
<id>350000</id>
<name>STR_NPC_NO_NAME</name>
<body> </body>
</string>
</strings>
Here is example of XML::Twig usage. The main advantage is that it is not holding whole file in memory, so processing is much faster. The code below is trying to emulate operation of script from question.
use XML::Twig;
my %strings = ();
XML::Twig->new(
twig_handlers => {
'strings/string' => sub {
$strings{ lc $_->first_child('name')->text }
= $_->first_child('id')->text
},
}
)->parsefile('B.xml');
print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
XML::Twig->new(
twig_handlers => {
'npc_client' => sub {
my $nameid = eval { $strings{ lc $_->first_child('desc')->text } };
# calculate rank as needed
my $hpgauge_level = eval { $_->first_child('hpgauge_level')->text };
$rank = $hpgauge_level >= 28 ? 'ERROR'
: $hpgauge_level > 25 ? 'LEGENDARY'
: $hpgauge_level > 21 ? 'HERO'
: $hpgauge_level > 10 ? 'ELITE'
: $hpgauge_level > 0 ? 'NORMAL'
: $hpgauge_level;
my $npc_id = eval { $_->first_child('id')->text };
my $name = eval { $_->first_child('name')->text };
my $tribe = eval { $_->first_child('tribe')->text };
my $scale = eval { $_->first_child('scale')->text };
my $race_type = eval { $_->first_child('race_type')->text };
print
qq|\t<npc_template npc_id="$npc_id" name="$name" name_id="$nameid" height="$scale" rank="$rank" tribe="$tribe" race="$race_type" hp_gauge="$hpgauge_level"/>\n|;
$_->purge;
}
}
)->parsefile('A.xml');
print "</<npc_templates>";
Grab all the interesting 'desc' fields from file A and put them in a hash. You only have to parse it once, but if it still takes too long have a look at XML::Twig.
Parse file B. once and extract the stuff you need. Use the hash.
Looks like you only need parts of the xml files. XML::Twig can parse only the elements you are interested in and throw away the rest using the "twig_roots" parameter. XML::Simple is easier to get started with though..
Although I can't help you with the specifics of your Perl code, there are some general guidelines when dealing with large volumes of XML data. There are, broadly speaking, 2 kinds of XML APIs - DOM based and Stream based. Dom based API's (like XML DOM) will parse an entire XML document in to memory before the user-level API becomes "available", whereas with a stream based API (like SAX) the implementation doesn't need to parse the whole XML document. One benefit of Stream based parsers are that they typically use much less memory, as they don't need to hold the entire XML document in memory at once - this is obviously a good thing when dealing with large XML documents. Looking at the XML::Simple docs here, it's seems there may be SAX support available - have you tried this?
I'm not a perl guy, so take this with a grain of salt, but I see 2 problems:
The fact that you are iterating over all of the values in file B until you find the correct value for each element in file A is inefficient. Instead, you should be using some sort of map/dictionary for the values in file B.
It looks like you are parsing the both files in memory before you even start processing. File A would be better processed as a stream instead of loading the entire document into memory.