What's the ideal order of checks for early bailout when processing files with Perl? - performance

Parsing a directory tree with hundreds of thousands of of files looking for valid (non-empty, readable) log files. What is the most efficient order of tests for early bail?
Here's an example I use as a file::find preprocessor stage and, being new to Perl, I wonder what tests are slowests / redundant / inefficiently ordered?
sub filter {
my $nicename = substr( $File::Find::dir, $_pathLength );
my #clean;
my $filecount = my $dircount = 0;
foreach (#_) {
next unless -R $_; # readable
next unless -f _ || -d _; # file or dir.
next if ( $_ =~ m/^\./ ); # ignore files/folders starting with a period
if ( -f _ ) { # regular file
next unless ( my $size = -s _ ); # does it have a size?
next unless ( $_ =~ m/([^.]+)$/ )[0] eq $_log_file_ext; # correct file extension?
next if exists( $_previousRun{ $_ . " ($size)" } ); # don't add files we've already processed
$filecount++;
} elsif ( -d _ ) { # dir
$dircount++;
}
push( #clean, $_ );
}
$_fileCount += $filecount;
$_dirCount += $dircount;
Utils::logit("'$nicename' contains $filecount new files and $dircount folders to explore.");
return #clean;
}
Any info you can provide on Perls internals and behaviours would be useful to me.
At the very end I run some specific checks for "regular file" and "directory". Are there other things I should check for and avoid adding to my clean list?

As a rough rule of thumb, 'going to disk' it the most expensive thing you'll be doing.
So when trying to optimise IO based:
First, discard anything you can based on name/location. (e.g. 'does filename contain a .')
Then discard based on file attributes - coalesce if you can into a single stat call, because then you're making a single IO.
And then do anything else.
I'm at least fairly sure that your -s -d -f etc. will be triggering stat() operations each time they go. (Which will probably get cached, so it doesn't hurt that much). But you do also test -f and -d twice - once to do the next unless and again to do the if
But you might find you can do a single stat and get most of the metadata you're interested in:
http://perldoc.perl.org/functions/stat.html
In the grand scheme of things though - I wouldn't worry about it too much. Your limiting factor will be disk IO, and the odd additional stat or regular expressions won't make much difference to the overall speed.

Related

How do I add new lines after deleting a large amount of text in Perl windows?

I'm trying to remove a large amount of text from a file before inserting a few new lines. I can delete everything after the word 'CParticleSystemDefinition' with a single line of code like this
perl -0777 -pi -we "s/CParticleSystemDefinition\x22\K.*/\n}/s" "D:\Steam\steamapps\common\dota 2 beta\content\dota_addons\custom\particles\generic_gameplay\winter_effects_creep.vpcf"
But when I try to change the code slightly so that it adds a few new lines like this, it doesn't work
perl -0777 -pi -we "s/CParticleSystemDefinition\x22\K.*/\n m_Children = \n [\n {\n m_ChildRef = resource:\x22particles/generic_gameplay/winter_effects_breath.vpcf\x22\n },\n ]\n}/s" "D:\Steam\steamapps\common\dota 2 beta\content\dota_addons\custom\particles\generic_gameplay\winter_effects_creep.vpcf"
So, basically, what I want to do is make this file
{
_class = "CParticleSystemDefinition"
m_bShouldHitboxesFallbackToRenderBounds = false
m_nMaxParticles = 24
m_flConstantRadius = 15.000000
m_flConstantLifespan = 0.500000
m_ConstantColor =
[
212,
170,
145,
255,
]
m_bShouldSort = false
m_Renderers =
[
{
_class = "C_OP_RenderSprites"
m_nSequenceCombineMode = "SEQUENCE_COMBINE_MODE_USE_SEQUENCE_0"
m_bMod2X = true
m_nOrientationType = 3
m_hTexture = resource:"materials/particle/footprints/footprints_generic.vtex"
m_flAnimationRate = 1.000000
},
]
m_Emitters =
[
{
_class = "C_OP_ContinuousEmitter"
m_flEmitRate = 10.000000
m_flStartTime = 0.500000
m_nScaleControlPoint = 5
},
]
}
look like this
{
_class = "CParticleSystemDefinition"
m_Children =
[
{
m_ChildRef = resource:"particles/generic_gameplay/winter_effects_breath.vpcf"
},
]
}
Do it in two steps -- clear the rest of the file after that phrase, then add the desired text
perl -0777 -i.bak -wpe"s{Definition\x22\K.*}{}s; $_ .= qq(\n\tm_Children...)" file
where I've used ellipses to indicate the rest, for clarity. I added .bak to keep a backup file, until this is tested well enough.
Adding a string in the replacement part is fine as well of course -- I don't readily see what fails (and how?) in your code. Breaking it up into two steps simply makes it easier to review and organize it better but one can also run that code in the replacement part, using /e modifier
perl -0777 -i.bak -wpe"
s{Definition\x22\K.*}{
# any valid Perl code, what it evaluates to is used as replacement
qq(\n\tm_Children...)
}es;
" file
If you don't want tabs, which may or may not get expanded depending on various settings and on what's done with this, can prepare and use a string of spaces instead. Then we might as well build the replacement more systematically
perl -0777 -i.bak -wpe"
s{Definition\x22\K.*}{}s;
$s4 = q( ) x 4; # four spaces
$_ .= qq(\n${s4}m_Children =\n$s4) . join qq(\n$s4),
q([),
q({),
qq($s4).q(m_ChildRef = ...) # etc
qq(\n)
" file
Now one can either make this into a better system (adding a suitable programming construct for each new level of indentation for example, like map over such lines so to add indentation to all in one statement), if there is a lot -- or condense it if there's really just a few lines.
Again, this can run inside the regex's replacement side, with the additional /e modifier.
This can be done line-by-line in one pass as well, using the read-write (+<) mode for open
perl -MPath::Tiny -wE"
$f = shift // die qq(Need a filename);
open $fh, qq(+<), $f or die qq(Cant open $f: $!);
while (<$fh>) { last if /Definition\x22$/ }; # past the spot to keep
truncate $fh, tell($fh); # remove the rest
say qq(File now:\n), path($f)->slurp; # (just to see it now)
say $fh $_ for # add new content
qq(new line 1),
qq(new line 2)
" filename
(Carefully with read-write modes. Please read the docs with care first.)

Implemented merge sort for large CSV files in Python3, slow performance

so I decided to implement merge sort in Python3 to handle large CSV files (working with 5GB files >.<) and I think I have the logic down correctly, the problem is, that it's quite slow, I'm just wondering if you guys have any suggestions on how to alter my code for a faster performance?
Thanks and please bear with my code, I'm still new to Python ^^
Here's the main piece of the merge sort code, note that this is after breaking the file into chunks and sorting each chunk:
def merge_sort():
files_to_merge = os.listdir(temp_folder)
files_left = len(files_to_merge)
print("Merging {} files...".format(files_left))
temp_file_count = files_left + 1
while files_left != 1:
first_file = temp_folder + files_to_merge[0]
print(first_file)
second_file = temp_folder + files_to_merge[1]
print(second_file)
# Process both files.
with open(first_file, 'r', encoding='utf-8') as file_1:
with open(second_file, 'r', encoding='utf-8')as file_2:
# Setup
temp_file = temp_folder + "tempFile - {:03}.csv".format(temp_file_count)
file1_line, file2_line = file_1.readline(), file_2.readline()
compare_values_list = [file1_line.split(','), file2_line.split(',')]
print("Writing to >> {}...".format(temp_file))
# Keep going until all values have been read from both files.
with open(temp_file, 'a', encoding='utf-8') as m_file:
while len(compare_values_list) != 0 or (file1_line != '' or file2_line != ''):
# Grab the highest value from the list, write to a file, and delete it.
compare_values_list.sort(key=sorter) # sorter = operator.itemgetter(sort_key)
line_to_write = ','.join(compare_values_list[0])
del compare_values_list[0]
m_file.write(line_to_write)
# Get the next values from the file and check whether to add to the list.
file1_line, file2_line = file_1.readline(), file_2.readline()
if file1_line != '' and file2_line != '':
compare_values_list.append(file1_line.split(','))
compare_values_list.append(file2_line.split(','))
elif file1_line != '' and file2_line == '':
compare_values_list.append(file1_line.split(','))
elif file1_line == '' and file2_line != '':
compare_values_list.append(file2_line.split(','))
# Clean up files and update values.
os.remove(first_file)
os.remove(second_file)
temp_file_count += 1
files_to_merge = os.listdir(temp_folder)
files_left = len(files_to_merge)
print("Finish merging files.")
There are 2 slow parts that jump out.
First is that your script opens the tempfile whenever it writes something. Move these line outside the nested while loop:
with open(temp_file, 'a', encoding='utf-8') as m_file:
m_file.write(line_to_write)
You could might also consider saving the data to a variable in memory, but I'm not sure how good of an idea that is if the file will be large.
Second, is your use of compare_values_list. You are frequently appending and deleting, which is requires a lot of work for reallocating space in memory. You're also recreating the list from scratch very often. First try avoiding the copy of the list for each loop and sort in place:
compare_values_list.sort(key=sorter)
should help you avoid that. If you want to try to make it faster, preallocate the list and manage it's size. Something like:
compare_values_list_capacity = 1000
compare_values_list_size = 0
compare_values_list = [None]*compare_values_list_capacity
though I am hazy on the details of mixing these 2 solutions - I'm not sure this will work with the sorting in place, so it's worth trying both and seeing which works.

Parsing string into ARGV equivalent (Windows and Perl)

Edit - Answer posted below
I have a script that usually uses #ARGV arguments but in some cases it is invoked by another script (which I cannot modify) that instead only passes a config filename which among other things has the command line options that should have been passed directly.
Example:
Args=--test --pdf "C:\testing\my pdf files\test.pdf"
If possible I'd like a way to parse this string into an array that would be identical to #ARGV.
I have a workaround where I setup an external perl script that just echos #ARGV, and I invoke this script like below (standard boilerplate removed).
echo-args.pl
print join ("\n", #ARGV);
test-echo-args.pl
$my_args = '--test --pdf "C:\testing\my pdf files\test.pdf"';
#args = map { chomp ; $_ } `perl echo-args.pl $my_args`;
This seems inelegant but it works. Is there a better way without invoking a new process? I did try splitting and processing but there are some oddities on the command line e.g. -a"b c" becomes '-ab c' and -a"b"" becomes -ab" and I'd rather not worry about edge cases but I know that'll bite me one day if I don't.
Answer - thanks ikegami!
I've posted a working program below that uses Win32::API and CommandLineToArgvW from shell32.dll based on ikegami's advice. It is intentionally verbose in the hopes that it'll be more easy to follow for anyone like myself who is extremely rusty with C and pointer arithmetic.
Any tips are welcome, apart from the obvious simplifications :)
use strict;
use warnings;
use Encode qw( encode decode );
use Win32::API qw( );
use Data::Dumper;
# create a test argument string, with some variations, and pack it
# apparently an empty string returns $^X which is documented so check before calling
my $arg_string = '--test 33 -3-t" "es 33\t2 ';
my $packed_arg_string = encode('UTF-16le', $arg_string."\0");
# create a packed integer buffer for output
my $packed_argc_buf_ptr = pack('L', 0);
# create then call the function and get the result
my $func = Win32::API->new('shell32.dll', 'CommandLineToArgvW', 'PP', 'N')
or die $^E;
my $ret = $func->Call($packed_arg_string, $packed_argc_buf_ptr);
# unpack to get the number of parsed arguments
my $argc = unpack('L', $packed_argc_buf_ptr);
print "We parsed $argc arguments\n";
# parse the return value to get the actual strings
my #argv = decode_LPWSTR_array($ret, $argc);
print Dumper \#argv;
# try not to leak memory
my $local_free = Win32::API->new('kernel32.dll', 'LocalFree', 'N', '')
or die $^E;
$local_free->Call($ret);
exit;
sub decode_LPWSTR_array {
my ($ptr, $num) = #_;
return undef if !$ptr;
# $ptr is the memory location of the array of strings (i.e. more pointers)
# $num is how many we need to get
my #strings = ();
for (1 .. $num) {
# convert $ptr to a long, using that location read 4 bytes - this is the pointer to the next string
my $string_location = unpack('P4', pack('L', $ptr));
# make it human readable
my $readable_string_location = unpack('L', $string_location);
# decode the string and save it for later
push(#strings, decode_LPCWSTR($readable_string_location));
# our pointers are 32-bit
$ptr += 4;
}
return #strings;
}
# Copied from http://stackoverflow.com/questions/5529928/perl-win32api-and-pointers
sub decode_LPCWSTR {
my ($ptr) = #_;
return undef if !$ptr;
my $sW = '';
for (;;) {
my $chW = unpack('P2', pack('L', $ptr));
last if $chW eq "\0\0";
$sW .= $chW;
$ptr += 2;
}
return decode('UTF-16le', $sW);
}
In unix systems, it's the shell that parses that shell command into strings. But in Windows, it's up to each application. I think this is normally done using the CommandLineToArgv system call (which you could call with the help of Win32::API), but the spec is documented here if you want to reimplement it yourself.

Perl Out Of Memory

I have a script that reads two csv files and compares them to find out if an ID that appears in one also appears in the other. The error I am receiving is as follows:
Out of memory during "large" request for 67112960 bytes, total sbrk() is 348203008 bytes
And now for the code:
use strict;
use File::Basename;
my $DAT = $ARGV[0];
my $OPT = $ARGV[1];
my $beg_doc = $ARGV[2];
my $end_doc = $ARGV[3];
my $doc_counter = 0;
my $page_counter = 0;
my %opt_beg_docs;
my %beg_docs;
my ($fname, $dir, $suffix) = fileparse($DAT, qr/\.[^.]*/);
my $outfile = $dir . $fname . "._IMGLOG";
open(OPT, "<$OPT");
while(<OPT>){
my #OPT_Line = split(/,/, $_);
$beg_docs{#OPT_Line[0]} = "Y" if(#OPT_Line[3] eq "Y");
$opt_beg_docs{#OPT_Line[0]} = "Y";
}
close(OPT);
open(OUT, ">$outfile");
while((my $key, my $value) = each %opt_beg_docs){
print OUT "$key\n";
}
close(OUT);
open(DAT, "<$DAT");
readline(DAT); #skips header line
while(<DAT>){
$_ =~ s/\xFE//g;
my #DAT_Line = split(/\x14/, $_);
#gets the prefix and the range of the beg and end docs
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
(my $beg = #DAT_Line[$beg_doc]) =~ s/\D//g;
(my $end = #DAT_Line[$end_doc]) =~ s/\D//g;
#print OUT "BEGDOC: $beg ENDDOC: $end\n";
foreach($beg .. $end){
my $doc_id = $pre . $_;
if($opt_beg_docs{$doc_id} ne "Y"){
if($beg_docs{$doc_id} ne "Y"){
print OUT "$doc_id,DOCUMENT NOT FOUND IN OPT FILE\n";
$doc_counter++;
} else {
print OUT "$doc_id,PAGE NOT FOUND IN OPT FILE\n";
$page_counter++;
}
}
}
}
close(DAT);
close(OUT);
print "Found $page_counter missing pages and $doc_counter missing document(s)";
Basically I get all the ID's from the file I am checking against to see if the ID exists in. Then I loop over the and generate the ID's for the other file, because they are presented as a range. Then I take the generated ID and check for it in the hash of ID's.
Also forgot to note I am using Windows
You're not using use warnings;, you're not checking for errors on opening files, and you're not printing out debugging statements showing the lines that you are reading in.
Do you know what the input file looks like? If it has no line breaks, you are reading the entire file in all at once, which will be disastrous if it is large. Pay attention to how you are parsing the file.
I'm not sure if it's the cause of your error, but inside your loop where you're reading DAT, you probably want to replace this:
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
with this:
(my $pre = $DAT_Line[$beg_doc]) =~ s/[0-9]//g;
and same for the other two lines there.
You're closing your OUT file handle and then trying to print to it inside the DAT loop, which, I think might be outputting to random memory, since you closed the FILEHANDLE - surprised this didn't output an error.
Remove the first close(OUT); and see if that improves.
I still don't know what your question is, if it's about the error message it means you've run out of memory. If it's about the message itself - you're trying to consume too much memory. If it's why you're consuming too much memory, I'd first ask if you read my message above, then I'd ask how much memory your system has, then I'd follow up with seeing if it improves if you take the regex away.

Could I do this blind relative to absolute path conversion (for perforce depot paths) better?

I need to "blindly" (i.e. without access to the filesystem, in this case the source control server) convert some relative paths to absolute paths. So I'm playing with dotdots and indices. For those that are curious I have a log file produced by someone else's tool that sometimes outputs relative paths, and for performance reasons I don't want to access the source control server where the paths are located to check if they're valid and more easily convert them to their absolute path equivalents.
I've gone through a number of (probably foolish) iterations trying to get it to work - mostly a few variations of iterating over the array of folders and trying delete_at(index) and delete_at(index-1) but my index kept incrementing while I was deleting elements of the array out from under myself, which didn't work for cases with multiple dotdots. Any tips on improving it in general or specifically the lack of non-consecutive dotdot support would be welcome.
Currently this is working with my limited examples, but I think it could be improved. It can't handle non-consecutive '..' directories, and I am probably doing a lot of wasteful (and error-prone) things that I probably don't need to do because I'm a bit of a hack.
I've found a lot of examples of converting other types of relative paths using other languages, but none of them seemed to fit my situation.
These are my example paths that I need to convert, from:
//depot/foo/../bar/single.c
//depot/foo/docs/../../other/double.c
//depot/foo/usr/bin/../../../else/more/triple.c
to:
//depot/bar/single.c
//depot/other/double.c
//depot/else/more/triple.c
And my script:
begin
paths = File.open(ARGV[0]).readlines
puts(paths)
new_paths = Array.new
paths.each { |path|
folders = path.split('/')
if ( folders.include?('..') )
num_dotdots = 0
first_dotdot = folders.index('..')
last_dotdot = folders.rindex('..')
folders.each { |item|
if ( item == '..' )
num_dotdots += 1
end
}
if ( first_dotdot and ( num_dotdots > 0 ) ) # this might be redundant?
folders.slice!(first_dotdot - num_dotdots..last_dotdot) # dependent on consecutive dotdots only
end
end
folders.map! { |elem|
if ( elem !~ /\n/ )
elem = elem + '/'
else
elem = elem
end
}
new_paths << folders.to_s
}
puts(new_paths)
end
Let's not reinvent the wheel... File.expand_path does that for you:
[
'//depot/foo/../bar/single.c',
'//depot/foo/docs/../../other/double.c',
'//depot/foo/usr/bin/../../../else/more/triple.c'
].map {|p| File.expand_path(p) }
# ==> ["//depot/bar/single.c", "//depot/other/double.c", "//depot/else/more/triple.c"]
Why not just use File.expand_path:
irb(main):001:0> File.expand_path("//depot/foo/../bar/single.c")
=> "//depot/bar/single.c"
irb(main):002:0> File.expand_path("//depot/foo/docs/../../other/double.c")
=> "//depot/other/double.c"
irb(main):003:0> File.expand_path("//depot/foo/usr/bin/../../../else/more/triple.c")
=> "//depot/else/more/triple.c"
For a DIY solution using Arrays, this comes to mind (also works for your examples):
absolute = []
relative = "//depot/foo/usr/bin/../../../else/more/triple.c".split('/')
relative.each { |d| if d == '..' then absolute.pop else absolute.push(d) end }
puts absolute.join('/')
Python code:
paths = ['//depot/foo/../bar/single.c',
'//depot/foo/docs/../../other/double.c',
'//depot/foo/usr/bin/../../../else/more/triple.c']
def convert_path(path):
result = []
for item in path.split('/'):
if item == '..':
result.pop()
else:
result.append(item)
return '/'.join(result)
for path in paths:
print convert_path(path)
prints:
//depot/bar/single.c
//depot/other/double.c
//depot/else/more/triple.c
You can use the same algorithm in Ruby.

Resources