Perl Substring extraction using regex: Split vs Substitution - performance

I have written the following snippets of code to split a properties string:
Input
$line='VarBinds=var0\=DU_/data02;var1\=GE;var2\=95;var3\=LT;var4\=95;var5\=95';
Code 1:
my ($field,#v)=split /=/, $line;
my $value=join '=', #v;
Code 2:
my $field=$line;
my $value = $field;
$field =~ s#^([[:alnum:]]+)=.*#$1#;
$value =~ s#^[[:alnum:]]+=##;
Which is a better piece of code, and why? This article on perl monks leads me to believe that Code 1 is better than Code 2, but I am not sure.
Please note that I've used the code below(which is better than both Code 1 and Code 2).
My Code:
my ($field,$value)=split /=/, $line, 2;
Any improvement to My Code are also welcome.
EDIT:
Clarified the question a bit.

Using a limit on split is good, if you can rely on the order of your fields. Code #2 is somewhat crude, but does the same basic job (assuming that no string contains newlines). Your method, and code #1 and #2 all ignore the escaped equal signs, though. You can use Text::ParseWords to overcome that:
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
my $line = 'VarBinds=var0\=DU_/data02;var1\=GE;var2\=95;var3\=LT;var4\=95;';
my #f = quotewords('=', 1, $line);
print Dumper \#f;
Output
$VAR1 = [
'VarBinds',
'var0\\=DU_/data02;var1\\=GE;var2\\=95;var3\\=LT;var4\\=95;'
];

Related

Checking really fast if one of many strings exists in one of many other strings, in Perl

Let's say I have a set of 100,000 strings, each one around 20-50 bytes on average.
Then let's say I have another set of 100,000,000 strings, each one also around 20-50 bytes on average.
I'd like to go through all 100,000,000 strings from the second set and check if any one of the strings from the first set exist in any one string from the second set.
Example: string from first set: "abc", string from second set: "123abc123" = match!
I've tried using Perl's index(), but it's not fast enough. Is there a better way to do this type of matching?
I found Algorithm::AhoCorasik::XS on CPAN, which implements the classic, very efficient multiple string search Aho-Corasick algorithm (Same one used by grep -F), and it seems to be reasonably fast (About half the speed of an equivalent grep invocation):
Example script:
#!/usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use Algorithm::AhoCorasick::XS;
open my $set1, "<", "set1.txt";
my #needles = <$set1>;
chomp #needles;
my $search = Algorithm::AhoCorasick::XS->new(\#needles);
open my $set2, "<", "set2.txt";
while (<$set2>) {
chomp;
say if defined $search->first_match($_);
}
and using it (With randomly-generated test files):
$ wc -l set1.txt set2.txt
10000 set1.txt
500000 set2.txt
510000 total
$ time perl test.pl | wc -l
458414
real 0m0.403s
user 0m0.359s
sys 0m0.031s
$ time grep -Ff set1.txt set2.txt | wc -l
458414
real 0m0.199s
user 0m0.188s
sys 0m0.031s
You should use a regex alternation, like:
my #string = qw/abc def ghi/;
my $matcher = qr/#{[join '|', map quotemeta, sort #string]}/;
This should be faster than using index. But it can be made faster yet:
Up to a certain limit, depending on both the length and number of the strings, perl will build a trie for efficient matching; see e.g. https://perlmonks.org/?node_id=670558. You will want to experiment with how many strings you can include in a single regex to generate an array of regexes. Then combine those separate regexes into a single one (untested):
my #search_strings = ...;
my #matchers;
my $string_limit = 3000; # a guess on my part
my #strings = sort #search_strings;
while (my #subset = splice #strings, 0, $string_limit) {
push #matchers, qr/^.*?#{[join '|', map quotemeta, sort #subset]}/s;
}
my $matcher = '(?:' . join('|', map "(??{\$matchers[$_]})", 0..$#matchers) . ')';
$matcher = do { use re 'eval'; qr/$matcher/ };
/$matcher/ and print "line $. matched: $_" while <>;
The (??{...}) construct is needed to join the separate regexes; without it, the subregexes are all just interpolated and the joined regex compiled all together, removing the trie optimization. Each subregex starts with ^.*? so it searches the entire string; without that, the joined regex would have to invoke each subregex separately for each position in the string.
Using contrived data, I'm seeing about 3000 strings searched per second with this approach in a not very fast vm. Using the naive regex is less than 50 strings per second. Using grep, as suggested in a comment by Shawn, is faster (about 4200 strings per second for me) but gives you less control if you want to do things like identify which strings matched or at what positions.
You may want to have a look at https://github.com/leendo/hello-world .
Its parallel processing makes it really fast. Either just type in all search terms individually or as || conjunctions, or (better) adapt it to run the second set programatically in one go.
Here is an idea: you could partition the dictionary into lists of words that have the same 2 or 3 letter prefix. You would then iterate on the large set and for each position in each string, extract the prefix and try and match the strings that have this prefix.
You would use hashtable to store the lists with O(1) lookup time.
If some words in the dictionary are shorter than the prefix length, you would have to special case short words.
Making prefixes longer will make the hashtable larger but the lists shorter, improving the test time.
I have no clue if this can be implemented efficiently in Perl.
The task is quite simple and perhaps used on everyday basis around the globe.
Please see following code snippet for one of many possible implementations
use strict;
use warnings;
use feature 'say';
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
my $re;
GetOptions(
'sample|s=s' => \$opt{sample},
'debug|d' => \$opt{debug},
'help|h' => \$opt{help},
'man|m' => \$opt{man}
) or pod2usage(2);
pod2usage(1) if $opt{help};
pod2usage(-verbose => 2) if $opt{man};
pod2usage("$0: No files given.") if ((#ARGV == 0) && (-t STDIN));
$re = read_re($opt{sample});
say "DEBUG: search pattern is ($re)" if $opt{debug};
find_in_file($re);
sub find_in_file {
my $search = shift;
while( <> ) {
chomp;
next unless /$search/;
say;
}
}
sub read_re {
my $filename = shift;
open my $fh, '<', $filename
or die "Couldn't open $filename";
my #data = <$fh>;
close $fh;
chomp #data;
my $re = join('|', #data);
return $re;
}
__END__
=head1 NAME
file_in_file.pl - search strings of one file in other
=head1 SYNOPSIS
yt_video_list.pl [options] -s sample.txt dbfile.txt
Options:
-s,--sample search pattern file
-d,--debug debug flag
-h,--help brief help message
-m,--man full documentation
=head1 OPTIONS
=over 4
=item B<-s,--sample>
Search pattern file
=item B<-d,--debug>
Print debug information.
=item B<-h,--help>
Print a brief help message and exits.
=item B<-m,--man>
Prints the manual page and exits.
=back
B<This program> seaches patterns from B<sample> file in B<dbfile>
=cut

decode_utf8 consumes time in execution

The following is the code snippet that I have been working on to improve its speed.
use strict;
use warnings;
use Encode;
open(IN,"<utf8",$ARGV[0]) or die "Cannot open $ARGV[0]:$!\n"; ##treat it as a huge data of 35,000 lines in devnagari script.
my #in = <IN>;
close(IN);
my $key = "अच्छा"; #key to be matched contains devanagari script as a string
foreach my $in(#in) {
chomp($in);
$key = decode_utf8($key);
$in = decode_utf8($in);
if($key eq $in) {
print "$key: matched\n";
}
else {
print "Not matched\n";
}
}
I am trying to match the lines in the file with key.
By profiling my code following are the results I get.
The results are such that decode_utf8 consumes 34% of time.
Since my data is in utf8 I used decode_utf8.
What can I do to improve the speed here. Any other workaround for replacing decode_utf8 in the code to match the unicode data.
The results are such that decode_utf8 consumes 34% of time.
Well yeah, that's basically all your program does.
More importantly, your code is buggy. You're decoding previously decoded strings!
You decode the contents of the file when you read them (via :utf8), and then you decode the already-decoded content in the loop.
You decode the contents of $key every pass through the loop so that by the fourth pass, you are using decode_utf8(decode_utf8(decode_utf8(decode_utf8($key)))).
Fix:
use utf8; # Source code encoded using UTF-8.
use open ':std', ':encoding(UTF-8)'; # Term provides and expects UTF-8. Default for files.
use strict;
use warnings;
my $key = "अच्छा";
my $found = 0;
while (my $line = <>) {
chomp($line);
if ($line eq $key) {
$found = 1;
last;
}
}
if ($found) {
print "Match found\n";
else {
print "No match\n";
}
This fixes other issues too:
Encodes outputs (using use open ':std').
Doesn't needlessly use global variables. (Use open my $IN instead of open IN.)
Doesn't needlessly load the entire file into memory.
Doesn't needlessly read the entire file.
Doesn't print Not matched 34,999 times when the key is found.
Avoids :utf8 in favour of :encoding(UTF-8).
Doesn't reinvent <>.
Doesn't hide a die in the middle of a line. (Put a line break before or die.)
Doesn't use "cannot". (Use "can't"!)

More Efficient Way to Find/Replace Non-Escaped Characters

I'm trying to find the best way to find and replace (in Ruby 1.9.2) all instances of a special code (%x) preceded by zero, or an even number of backslashes.
In other words, :
%x --> FOO
\%x --> \%x
\\%x --> \\FOO
\\\%x --> \\\%x
\\\\%x --> \\\\FOO
etc.
There may be multiple instances in a string: "This is my %x string with two %x codes."
With help from the questions asked here and here I got the following code to do what I want:
str.gsub(/
(?<!\\) # Not preceded by a single backslash
((?:\\\\)*) # Eat up any sets of double backslashes - match group 1
(%x) # Match the code itself - match group 2
/x,
# Keep the double backslashes (match group 1) then put in the sub
"\\1foo")
That regex seems kind of heavyweight, though. Since this code will be called with reasonable frequency in my application, I want to make sure I'm not missing a better (cleaner/more efficient) way to do this.
I can imagine two alternative regular expressions:
Using a look-behind assertion, as in your code. (look-behind-2)
Matching one more character, before the back-slashes. (alternative)
Other than that, I do only see a minor optimization for your regular expression. The "%x" is constant, so you do not have to capture it. (look-behind-1)
I am not sure which of these is actually more efficient. Therefore, I created a small benchmark:
$ perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
my $test = '%x \%x \\%x \\\%x \\\\%x \\\\\%x \\\\%x \\\%x \\%x \%x %x';
cmpthese 1_000_000, {
'look-behind-1' => sub { (my $t = $test) =~ s/(?<!\\)((?:\\\\)*)\%x/${1}foo/g },
'look-behind-2' => sub { (my $t = $test) =~ s/(?<!\\)((?:\\\\)*)(\%x)/${1}foo/g },
'alternative' => sub { (my $t = $test) =~ s/((?:^|[^\\])(?:\\\\)*)\%x/${1}foo/g },
};
Results:
Rate alternative look-behind-2 look-behind-1
alternative 145349/s -- -23% -26%
look-behind-2 188324/s 30% -- -5%
look-behind-1 197239/s 36% 5% --
As you can clearly see, the alternative regular expression is far behind the look-behind approach and capturing the "%x" is slightly slower than not capturing it.
regards, Matthias

aggregate totals when key changes in Perl

I have an input file with the following format
ant,1
bat,1
bat,2
cat,4
cat,1
cat,2
dog,4
I need to aggregate the col2 for each key (column1) so the result is:
ant,1
bat,3
cat,7
dog,4
Other considerations:
Assume that the input file is sorted
The input file is pretty large (about 1M rows), so I don't want to use an array and take up memory
Each input line should be processed as we read it, and move to the next line
I need to write the results to an outFile
I need to do this in Perl, but a pseudo-code or algorithm would help just as fine
Thanks!
This is what I came up with... want to see if this can be written better/elegant.
open infile, outFile
prev_line = <infile>;
print_line = $prev_line;
while(<>){
curr_line = $_;
#prev_cols=split(',', $prev_line);
#curr_cols=split(',', $curr_line);
if ( $prev_cols[0] eq $curr_cols[0] ){
$prev_cols[1] += curr_cols[1];
$print_line = "$prev_cols[0],$prev_cols[1]\n";
$print_flag = 0;
}
else{
$print outFile "$print_line";
$print_flag = 1;
$print_line = $curr_line;
}
$prev_line = $curr_line;
}
if($print_flag = 1){
print outFile "$curr_line";
}
else{
print outFile "$print_line";
}
#!/usr/bin/perl
use warnings;
use strict;
use integer;
my %a;
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
$a{$animal} += $n if defined $n;
}
print "$_,${a{$_}}\n" for sort keys %a;
This short code affords you the chance to learn Perl's excellent hash facility, as %a. Hashes are central to Perl. One really cannot write fluent Perl without them.
Observe incidentally that the code exercises Perl's interesting autovivification feature. The first time a particular animal is encountered in the input stream, no count exists, so Perl implicitly assumes a pre-existing count of zero. Thus, the += operator does not fail, even though it seems that it should. It just adds to zero in the first instance.
On the other hand, it may happen that not only the number of data but the number of animals is so large that one would not like to store the hash %a. In this case, one can still calculate totals, provided only that the data are sorted by animal in the input, as they are in your example. In this case, something like the following might suit (though regrettably it is not nearly so neat as the above).
#!/usr/bin/perl
use warnings;
use strict;
use integer;
my $last_animal = undef;
my $total_for_the_last_animal = 0;
sub start_new_animal ($$) {
my $next_animal = shift;
my $n = shift;
print "$last_animal,$total_for_the_last_animal\n"
if defined $last_animal;
$last_animal = $next_animal;
$total_for_the_last_animal = $n;
}
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
if (
defined($n) && defined($animal) && defined($last_animal)
&& $animal eq $last_animal
) { $total_for_the_last_animal += $n; }
else { start_new_animal $animal, $n; }
}
start_new_animal undef, 0;
Use Perl’s awk mode.
-a
turns on autosplit mode when used with a -n or -p. An implicit split command to the #F array is done as the first thing inside the implicit while loop produced by the -n or -p.
perl -ane 'print pop(#F), "\n";'
is equivalent to
while (<>) {
#F = split(' ');
print pop(#F), "\n";
}
An alternate delimiter may be specified using -F.
All that’s left for you is to accumulate the sums in a hash and print them.
$ perl -F, -lane '$s{$F[0]} += $F[1];
END { print "$_,$s{$_}" for sort keys %s }' input
Output:
ant,1
bat,3
cat,7
dog,4
It's trivial in perl. Loop on the file input. Split the input line on comma. For each key in column one keep a hash to which you add the value in column two. At the end of the file print the list of hash keys and their values. It can be done in one line but that would obfuscate the algorithm.

How do I parse YAML with nil values?

I apologize for the very specific issue I'm posting here but I hope it will help others that may also run across this issue. I have a string that is being formatted to the following:
[[,action1,,],[action2],[]]
I would like to translate this to valid YAML so that it can be parsed which would look like this:
[['','acton1','',''],['action2'],['']]
I've tried a bunch of regular expressions to accomplish this but I'm afraid that I'm at a complete loss. I'm ok with running multiple expressions if needed. For example (ruby):
puts s.gsub!(/,/,"','") # => [[','action1','',']','[action2]','[]]
puts s.gsub!(/\[',/, "['',") # => [['','action1','',']','[action2]','[]]
That's getting there, but I have a feeling I'm starting to go down a rat-hole with this approach. Is there a better way to accomplish this?
Thanks for the help!
This does the job for the empty fields (ruby1.9):
s.gsub(/(?<=[\[,])(?=[,\]])/, "''")
Or for ruby1.8, which doesn't support zero-width look-behind:
s.gsub(/([\[,])(?=[,\]])/, "\\1''")
Quoting non-empty fields can be done with one of these:
s.gsub(/(?<=[\[,])\b|\b(?=[,\]])/, "'")
s.gsub(/(\w+)/, "'\\1'")
In the above I'm making use of zero-width positive look behind and zero-width positive look ahead assertions (the '(?<=' and '(?=').
I've looked for some ruby specific documentation but could not find anything that explains these features in particular. Instead, please let me refer you to perlre.
It would be easier to just parse it, then output valid YAML.
Since I don't know Ruby, Here is an example in Perl.
Since you only want a subset of YAML, that appears to be similar to JSON, I used the JSON module.
I've been wanting an excuse to use Regexp::Grammars, so I used it to parse the data.
I guarantee it will work, no matter how deep the arrays are.
#! /usr/bin/env perl
use strict;
#use warnings;
use 5.010;
#use YAML;
use JSON;
use Regexp::Grammars;
my $str = '[[,action1,,],[action2],[],[,],[,[],]]';
my $parser = qr{
<match=Array>
<token: Text>
[^,\[\]]*
<token: Element>
(?:
<.Text>
|
<MATCH=Array>
)
<token: Array>
\[
(?:
(?{ $MATCH = [qw'']; })
|
<[MATCH=Element]> ** (,)
)
\]
}x;
if( $str =~ $parser ){
say to_json $/{match};
}else{
die $# if $#;
}
Which outputs.
[["","action1","",""],["action2"],[],["",""],["",[],""]]
If you really wanted YAML, just un comment "use YAML;", and replace to_json() with Dump()
---
-
- ''
- action1
- ''
- ''
-
- action2
- []
-
- ''
- ''
-
- ''
- []
- ''
Try this:
s.gsub(/([\[,])(?=[,\]])/, "\\1''")
.gsub(/([\[,])(?=[^'\[])|([^\]'])(?=[,\]])/, "\\+'");
EDIT: I'm not sure about the replacement syntax. That's supposed to be group #1 in the first gsub, and the highest-numbered participating group -- $+ -- in the second.

Resources