Perl Out Of Memory - windows

I have a script that reads two csv files and compares them to find out if an ID that appears in one also appears in the other. The error I am receiving is as follows:
Out of memory during "large" request for 67112960 bytes, total sbrk() is 348203008 bytes
And now for the code:
use strict;
use File::Basename;
my $DAT = $ARGV[0];
my $OPT = $ARGV[1];
my $beg_doc = $ARGV[2];
my $end_doc = $ARGV[3];
my $doc_counter = 0;
my $page_counter = 0;
my %opt_beg_docs;
my %beg_docs;
my ($fname, $dir, $suffix) = fileparse($DAT, qr/\.[^.]*/);
my $outfile = $dir . $fname . "._IMGLOG";
open(OPT, "<$OPT");
while(<OPT>){
my #OPT_Line = split(/,/, $_);
$beg_docs{#OPT_Line[0]} = "Y" if(#OPT_Line[3] eq "Y");
$opt_beg_docs{#OPT_Line[0]} = "Y";
}
close(OPT);
open(OUT, ">$outfile");
while((my $key, my $value) = each %opt_beg_docs){
print OUT "$key\n";
}
close(OUT);
open(DAT, "<$DAT");
readline(DAT); #skips header line
while(<DAT>){
$_ =~ s/\xFE//g;
my #DAT_Line = split(/\x14/, $_);
#gets the prefix and the range of the beg and end docs
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
(my $beg = #DAT_Line[$beg_doc]) =~ s/\D//g;
(my $end = #DAT_Line[$end_doc]) =~ s/\D//g;
#print OUT "BEGDOC: $beg ENDDOC: $end\n";
foreach($beg .. $end){
my $doc_id = $pre . $_;
if($opt_beg_docs{$doc_id} ne "Y"){
if($beg_docs{$doc_id} ne "Y"){
print OUT "$doc_id,DOCUMENT NOT FOUND IN OPT FILE\n";
$doc_counter++;
} else {
print OUT "$doc_id,PAGE NOT FOUND IN OPT FILE\n";
$page_counter++;
}
}
}
}
close(DAT);
close(OUT);
print "Found $page_counter missing pages and $doc_counter missing document(s)";
Basically I get all the ID's from the file I am checking against to see if the ID exists in. Then I loop over the and generate the ID's for the other file, because they are presented as a range. Then I take the generated ID and check for it in the hash of ID's.
Also forgot to note I am using Windows

You're not using use warnings;, you're not checking for errors on opening files, and you're not printing out debugging statements showing the lines that you are reading in.
Do you know what the input file looks like? If it has no line breaks, you are reading the entire file in all at once, which will be disastrous if it is large. Pay attention to how you are parsing the file.

I'm not sure if it's the cause of your error, but inside your loop where you're reading DAT, you probably want to replace this:
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
with this:
(my $pre = $DAT_Line[$beg_doc]) =~ s/[0-9]//g;
and same for the other two lines there.

You're closing your OUT file handle and then trying to print to it inside the DAT loop, which, I think might be outputting to random memory, since you closed the FILEHANDLE - surprised this didn't output an error.
Remove the first close(OUT); and see if that improves.
I still don't know what your question is, if it's about the error message it means you've run out of memory. If it's about the message itself - you're trying to consume too much memory. If it's why you're consuming too much memory, I'd first ask if you read my message above, then I'd ask how much memory your system has, then I'd follow up with seeing if it improves if you take the regex away.

Related

How do I add new lines after deleting a large amount of text in Perl windows?

I'm trying to remove a large amount of text from a file before inserting a few new lines. I can delete everything after the word 'CParticleSystemDefinition' with a single line of code like this
perl -0777 -pi -we "s/CParticleSystemDefinition\x22\K.*/\n}/s" "D:\Steam\steamapps\common\dota 2 beta\content\dota_addons\custom\particles\generic_gameplay\winter_effects_creep.vpcf"
But when I try to change the code slightly so that it adds a few new lines like this, it doesn't work
perl -0777 -pi -we "s/CParticleSystemDefinition\x22\K.*/\n m_Children = \n [\n {\n m_ChildRef = resource:\x22particles/generic_gameplay/winter_effects_breath.vpcf\x22\n },\n ]\n}/s" "D:\Steam\steamapps\common\dota 2 beta\content\dota_addons\custom\particles\generic_gameplay\winter_effects_creep.vpcf"
So, basically, what I want to do is make this file
{
_class = "CParticleSystemDefinition"
m_bShouldHitboxesFallbackToRenderBounds = false
m_nMaxParticles = 24
m_flConstantRadius = 15.000000
m_flConstantLifespan = 0.500000
m_ConstantColor =
[
212,
170,
145,
255,
]
m_bShouldSort = false
m_Renderers =
[
{
_class = "C_OP_RenderSprites"
m_nSequenceCombineMode = "SEQUENCE_COMBINE_MODE_USE_SEQUENCE_0"
m_bMod2X = true
m_nOrientationType = 3
m_hTexture = resource:"materials/particle/footprints/footprints_generic.vtex"
m_flAnimationRate = 1.000000
},
]
m_Emitters =
[
{
_class = "C_OP_ContinuousEmitter"
m_flEmitRate = 10.000000
m_flStartTime = 0.500000
m_nScaleControlPoint = 5
},
]
}
look like this
{
_class = "CParticleSystemDefinition"
m_Children =
[
{
m_ChildRef = resource:"particles/generic_gameplay/winter_effects_breath.vpcf"
},
]
}
Do it in two steps -- clear the rest of the file after that phrase, then add the desired text
perl -0777 -i.bak -wpe"s{Definition\x22\K.*}{}s; $_ .= qq(\n\tm_Children...)" file
where I've used ellipses to indicate the rest, for clarity. I added .bak to keep a backup file, until this is tested well enough.
Adding a string in the replacement part is fine as well of course -- I don't readily see what fails (and how?) in your code. Breaking it up into two steps simply makes it easier to review and organize it better but one can also run that code in the replacement part, using /e modifier
perl -0777 -i.bak -wpe"
s{Definition\x22\K.*}{
# any valid Perl code, what it evaluates to is used as replacement
qq(\n\tm_Children...)
}es;
" file
If you don't want tabs, which may or may not get expanded depending on various settings and on what's done with this, can prepare and use a string of spaces instead. Then we might as well build the replacement more systematically
perl -0777 -i.bak -wpe"
s{Definition\x22\K.*}{}s;
$s4 = q( ) x 4; # four spaces
$_ .= qq(\n${s4}m_Children =\n$s4) . join qq(\n$s4),
q([),
q({),
qq($s4).q(m_ChildRef = ...) # etc
qq(\n)
" file
Now one can either make this into a better system (adding a suitable programming construct for each new level of indentation for example, like map over such lines so to add indentation to all in one statement), if there is a lot -- or condense it if there's really just a few lines.
Again, this can run inside the regex's replacement side, with the additional /e modifier.
This can be done line-by-line in one pass as well, using the read-write (+<) mode for open
perl -MPath::Tiny -wE"
$f = shift // die qq(Need a filename);
open $fh, qq(+<), $f or die qq(Cant open $f: $!);
while (<$fh>) { last if /Definition\x22$/ }; # past the spot to keep
truncate $fh, tell($fh); # remove the rest
say qq(File now:\n), path($f)->slurp; # (just to see it now)
say $fh $_ for # add new content
qq(new line 1),
qq(new line 2)
" filename
(Carefully with read-write modes. Please read the docs with care first.)

Downloading multiple fasta files from ncbi

I'm trying to download all fasta files associated with one organism from ncbi.
I tried wget -r -l3 -A "*.fna.gz" ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/ to get all files ending in .fna.gz from the third level down, but then it just rejects everything with the following output:
Removed “ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Microcystis_aeruginosa/latest_assembly_versions/.listing”.
Rejecting “GCF_000010625.1_ASM1062v1”.
Rejecting “GCF_000307995.1_ASM30799v2”.
Rejecting “GCF_000312165.1_ASM31216v1”.
Rejecting “GCF_000312185.1_ASM31218v1”.
Rejecting “GCF_000312205.1_ASM31220v1”.
Rejecting “GCF_000312225.1_ASM31222v1”.
Rejecting “GCF_000312245.1_ASM31224v1”.
Rejecting “GCF_000312265.1_ASM31226v1”.
Rejecting “GCF_000312285.1_ASM31228v1”.
Rejecting “GCF_000312725.1_ASM31272v1”.
Rejecting “GCF_000330925.1_MicAerT1.0”.
Rejecting “GCF_000332585.1_MicAerD1.0”.
Rejecting “GCF_000412595.1_spc777-v1”.
Rejecting “GCF_000599945.1_Mic70051.0”.
Rejecting “GCF_000787675.1_ASM78767v1”.
Rejecting “GCF_000981785.1_ASM98178v1”.
Any ideas on why it's rejecting these directories? Thanks for your help.
Not exactly sure why it's rejecting your request, but when I was still doing this type of thing, I found that if I don't download queries in smaller batches, the NCBI server timed me out and blocked my IP for a while before I could download again. This doesn't seem to be the same problem that your seeing, but maybe this script might get the same thing done. Let me know if this helps.
#!/usr/bin/env python
from Bio import Entrez
search_term = raw_input("Organism name: ")
Entrez.email = "your_email#isp.com" # required by NCBI
search_handle = Entrez.esearch(db="nucleotide", term=search_term, usehistory="y")
search_results = Entrez.read(search_handle)
search_handle.close()
gi_list = search_results["IdList"]
count = int(search_results["Count"])
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
batch_size = 5 # download sequences in batches so NCBI doesn't time you out
with open("ALL_SEQ.fasta", "w") as out_handle:
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print "Going to download record %i to %i" % (start+1, end)
fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text",retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key)
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
print ("\nDownload completed")
I found a perl script that gets me close to accomplishing this task from here . Unfortunately, this script is just returning the ID's of the genomes, and not the actual sequences.
For example, the head of my output is:
gi|425458296|ref|NZ_CAIN00000000.1|NZ_CAIN01000000 Microcystis aeruginosa PCC 9808, whole genome shotgun sequencing project
gi|425448636|ref|NZ_CAIK00000000.1|NZ_CAIK01000000 Microcystis aeruginosa PCC 7941, whole genome shotgun sequencing project
Any perl users know what's going on?
use strict;
use LWP::Simple;
my ($name, $outname, $url, $xml, $out, $count, $query_key, $webenv, $ids);
my #genomeId;
my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $limit = 'wgs[prop]+AND+srcdb+refseq[prop])';
my #species = ('Microcystis aeruginosa');
foreach my $s (#species) {
undef #genomeId;
$query_key = $webenv = '';
$s =~ s/ /+/g;
# ESearch
$url = $base . "esearch.fcgi?db=genome&term=$s";
$xml = get($url);
$count = $1 if ($xml =~ /<Count>(\d+)<\/Count>/);
if ($count > 30) {
$url = $base . "esearch.fcgi?db=genome&term=$s&retmax=$count";
$xml = get($url);
}
while ($xml =~ /<Id>(\d+?)<\/Id>/gs) {
push(#genomeId, $1);
}
$ids = join(',', #genomeId);
# ELink
$url = $base . "elink.fcgidbfrom=genome&db=nuccore&cmd=neighbor_history&id=$ids&term=$limit";
$xml = get($url);
$query_key = $1 if ($xml =~ /<QueryKey>(\d+)<\/QueryKey>/);
$webenv = $1 if ($xml =~ /<WebEnv>(\S+)<\/WebEnv>/);
# EFetch
$url = $base . "efetch.fcgidb=nuccore&query_key=$query_key&WebEnv=$webenv&rettype=fasta&retmode=text";
$out = get($url);
open (OUT, ">$s.fna");
close OUT;
}

Parsing string into ARGV equivalent (Windows and Perl)

Edit - Answer posted below
I have a script that usually uses #ARGV arguments but in some cases it is invoked by another script (which I cannot modify) that instead only passes a config filename which among other things has the command line options that should have been passed directly.
Example:
Args=--test --pdf "C:\testing\my pdf files\test.pdf"
If possible I'd like a way to parse this string into an array that would be identical to #ARGV.
I have a workaround where I setup an external perl script that just echos #ARGV, and I invoke this script like below (standard boilerplate removed).
echo-args.pl
print join ("\n", #ARGV);
test-echo-args.pl
$my_args = '--test --pdf "C:\testing\my pdf files\test.pdf"';
#args = map { chomp ; $_ } `perl echo-args.pl $my_args`;
This seems inelegant but it works. Is there a better way without invoking a new process? I did try splitting and processing but there are some oddities on the command line e.g. -a"b c" becomes '-ab c' and -a"b"" becomes -ab" and I'd rather not worry about edge cases but I know that'll bite me one day if I don't.
Answer - thanks ikegami!
I've posted a working program below that uses Win32::API and CommandLineToArgvW from shell32.dll based on ikegami's advice. It is intentionally verbose in the hopes that it'll be more easy to follow for anyone like myself who is extremely rusty with C and pointer arithmetic.
Any tips are welcome, apart from the obvious simplifications :)
use strict;
use warnings;
use Encode qw( encode decode );
use Win32::API qw( );
use Data::Dumper;
# create a test argument string, with some variations, and pack it
# apparently an empty string returns $^X which is documented so check before calling
my $arg_string = '--test 33 -3-t" "es 33\t2 ';
my $packed_arg_string = encode('UTF-16le', $arg_string."\0");
# create a packed integer buffer for output
my $packed_argc_buf_ptr = pack('L', 0);
# create then call the function and get the result
my $func = Win32::API->new('shell32.dll', 'CommandLineToArgvW', 'PP', 'N')
or die $^E;
my $ret = $func->Call($packed_arg_string, $packed_argc_buf_ptr);
# unpack to get the number of parsed arguments
my $argc = unpack('L', $packed_argc_buf_ptr);
print "We parsed $argc arguments\n";
# parse the return value to get the actual strings
my #argv = decode_LPWSTR_array($ret, $argc);
print Dumper \#argv;
# try not to leak memory
my $local_free = Win32::API->new('kernel32.dll', 'LocalFree', 'N', '')
or die $^E;
$local_free->Call($ret);
exit;
sub decode_LPWSTR_array {
my ($ptr, $num) = #_;
return undef if !$ptr;
# $ptr is the memory location of the array of strings (i.e. more pointers)
# $num is how many we need to get
my #strings = ();
for (1 .. $num) {
# convert $ptr to a long, using that location read 4 bytes - this is the pointer to the next string
my $string_location = unpack('P4', pack('L', $ptr));
# make it human readable
my $readable_string_location = unpack('L', $string_location);
# decode the string and save it for later
push(#strings, decode_LPCWSTR($readable_string_location));
# our pointers are 32-bit
$ptr += 4;
}
return #strings;
}
# Copied from http://stackoverflow.com/questions/5529928/perl-win32api-and-pointers
sub decode_LPCWSTR {
my ($ptr) = #_;
return undef if !$ptr;
my $sW = '';
for (;;) {
my $chW = unpack('P2', pack('L', $ptr));
last if $chW eq "\0\0";
$sW .= $chW;
$ptr += 2;
}
return decode('UTF-16le', $sW);
}
In unix systems, it's the shell that parses that shell command into strings. But in Windows, it's up to each application. I think this is normally done using the CommandLineToArgv system call (which you could call with the help of Win32::API), but the spec is documented here if you want to reimplement it yourself.

What's the ideal order of checks for early bailout when processing files with Perl?

Parsing a directory tree with hundreds of thousands of of files looking for valid (non-empty, readable) log files. What is the most efficient order of tests for early bail?
Here's an example I use as a file::find preprocessor stage and, being new to Perl, I wonder what tests are slowests / redundant / inefficiently ordered?
sub filter {
my $nicename = substr( $File::Find::dir, $_pathLength );
my #clean;
my $filecount = my $dircount = 0;
foreach (#_) {
next unless -R $_; # readable
next unless -f _ || -d _; # file or dir.
next if ( $_ =~ m/^\./ ); # ignore files/folders starting with a period
if ( -f _ ) { # regular file
next unless ( my $size = -s _ ); # does it have a size?
next unless ( $_ =~ m/([^.]+)$/ )[0] eq $_log_file_ext; # correct file extension?
next if exists( $_previousRun{ $_ . " ($size)" } ); # don't add files we've already processed
$filecount++;
} elsif ( -d _ ) { # dir
$dircount++;
}
push( #clean, $_ );
}
$_fileCount += $filecount;
$_dirCount += $dircount;
Utils::logit("'$nicename' contains $filecount new files and $dircount folders to explore.");
return #clean;
}
Any info you can provide on Perls internals and behaviours would be useful to me.
At the very end I run some specific checks for "regular file" and "directory". Are there other things I should check for and avoid adding to my clean list?
As a rough rule of thumb, 'going to disk' it the most expensive thing you'll be doing.
So when trying to optimise IO based:
First, discard anything you can based on name/location. (e.g. 'does filename contain a .')
Then discard based on file attributes - coalesce if you can into a single stat call, because then you're making a single IO.
And then do anything else.
I'm at least fairly sure that your -s -d -f etc. will be triggering stat() operations each time they go. (Which will probably get cached, so it doesn't hurt that much). But you do also test -f and -d twice - once to do the next unless and again to do the if
But you might find you can do a single stat and get most of the metadata you're interested in:
http://perldoc.perl.org/functions/stat.html
In the grand scheme of things though - I wouldn't worry about it too much. Your limiting factor will be disk IO, and the odd additional stat or regular expressions won't make much difference to the overall speed.

Perl, cmd, $ARGV[0], slow

[Strawberry Perl v5.16.3, Windows 7 x64, executing via cmd, eg c:\strawberry> perl test.pl 100000]
SYMPTOM: The following code: foreach (1..$ARGV[0]) { foo($_); }, executes roughly 20% slower than if I had included this extra line, before it: my $num = $ARGV[0];
QUESTION: Can anyone help me understand why?
Notice, in the second case, that after I initialize and set $num, I do not then use $num in the loop parameters. Were this the case, I could probably be convinced that repeatedly testing against $ARGV[0] in a forloop is somehow slower than a variable that I define myself... but this is not the case.
To track time, I use: use Time::HiRes; my $time = [Time::HiRes::gettimeofday()]; at the top of my script, and: print "\n1: ", Time::HiRes::tv_interval($time); at the bottom.
Confused!
Thanks,
Michael
EDIT
I am including the entire script, with a comment preceding the offending line... Interestingly, it looks like the time discrepancy is at least partially dependent on my redundant initialization of %h, as well as #chain... This is getting weird.
use Time::HiRes; my $time = [Time::HiRes::gettimeofday()];
#my $max=$ARGV[0];
my %h = (1=>1,89=>89);
$h{1}=1;
$h{89}=89;
my #chain=();
my $ans=0;
sub sum{my $o=0; foreach (#_){$o+=$_}; return $o;}
foreach (1..$ARGV[0]-1){
my $x=$_;
my #chain = ();
while(!exists($h{$x})){
push(#chain,$x);
$x = sum(map {$_**2} split('',$x));
}
foreach (#chain){$h{$_}=$h{$x} if !exists($h{$_});}
}
print "\n1: ", Time::HiRes::tv_interval($time);
foreach (1..$ARGV[0]){$ans++ if ($h{$_}==89);}
print "\n2: ", Time::HiRes::tv_interval($time);
On my system (perl 5.16.3 on GNU/Linux) there is no measurable difference. The standard deviation of the timings is larger than the difference between measurements of different versions.
For each variant of the script, 10 executions were performed. The $ARGV[0] was 3.5E5 in all cases (350000).
Without my $num = $ARGV[0]:
$ perl measure.pl
2.369921 2.38991 2.380969 4.419895 2.398861 2.420928 2.388721 2.368144 2.387212 2.386347
mean: 2.5910908
sigma: 0.609763793801797
With my $num = $ARGV[0]:
$ perl measure.pl
4.435764 2.419485 2.403696 2.401771 2.411345 2.466776 4.408127 2.416889 2.389191 2.397409
mean: 2.8150453
sigma: 0.803721101668365
The measure.pl script:
use strict; use warnings; use 5.016;
use List::Util 'sum';
my #times = map qx/perl your-algorithm.pl 3.5E5/, 1..10;
chomp #times;
say "#times";
say "mean: ", mean(#times);
say "sigma: ", sigma(#times);
sub mean { sum(#_)/#_ }
sub sigma {
my $mean = mean(#_);
my $variance = sum(map { ($_-$mean)**2 } #_) / #_;
sqrt $variance;
}
With your-algorithm.pl being reduced so that only one timing is printed:
foreach (1..$ARGV[0]){$ans++ if ($h{$_}==89);}
print Time::HiRes::tv_interval($time), "\n";

Resources