More Efficient Way to Find/Replace Non-Escaped Characters - ruby

I'm trying to find the best way to find and replace (in Ruby 1.9.2) all instances of a special code (%x) preceded by zero, or an even number of backslashes.
In other words, :
%x --> FOO
\%x --> \%x
\\%x --> \\FOO
\\\%x --> \\\%x
\\\\%x --> \\\\FOO
etc.
There may be multiple instances in a string: "This is my %x string with two %x codes."
With help from the questions asked here and here I got the following code to do what I want:
str.gsub(/
(?<!\\) # Not preceded by a single backslash
((?:\\\\)*) # Eat up any sets of double backslashes - match group 1
(%x) # Match the code itself - match group 2
/x,
# Keep the double backslashes (match group 1) then put in the sub
"\\1foo")
That regex seems kind of heavyweight, though. Since this code will be called with reasonable frequency in my application, I want to make sure I'm not missing a better (cleaner/more efficient) way to do this.

I can imagine two alternative regular expressions:
Using a look-behind assertion, as in your code. (look-behind-2)
Matching one more character, before the back-slashes. (alternative)
Other than that, I do only see a minor optimization for your regular expression. The "%x" is constant, so you do not have to capture it. (look-behind-1)
I am not sure which of these is actually more efficient. Therefore, I created a small benchmark:
$ perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
my $test = '%x \%x \\%x \\\%x \\\\%x \\\\\%x \\\\%x \\\%x \\%x \%x %x';
cmpthese 1_000_000, {
'look-behind-1' => sub { (my $t = $test) =~ s/(?<!\\)((?:\\\\)*)\%x/${1}foo/g },
'look-behind-2' => sub { (my $t = $test) =~ s/(?<!\\)((?:\\\\)*)(\%x)/${1}foo/g },
'alternative' => sub { (my $t = $test) =~ s/((?:^|[^\\])(?:\\\\)*)\%x/${1}foo/g },
};
Results:
Rate alternative look-behind-2 look-behind-1
alternative 145349/s -- -23% -26%
look-behind-2 188324/s 30% -- -5%
look-behind-1 197239/s 36% 5% --
As you can clearly see, the alternative regular expression is far behind the look-behind approach and capturing the "%x" is slightly slower than not capturing it.
regards, Matthias

Related

How do I remove a common substring using Ruby?

I have read How do I remove substring after a certain character in a string using Ruby?. This is close, but different.
I have these emails with a mask:
email1 = 'giovanna.macedo#lojas100.com.br-215000695716b.ct.domain.com.br'
email2 = 'alvaro-neves#stockshop.com-215000695716b.ct.domain.com.br'
email3 = 'filiallojas123#filiallojas.net-215000695716b.ct.domain.com.br'
I want to remove the substrings that are after .br, .com and .net. The return must be:
email1 = 'giovanna.macedo#lojas100.com.br'
email2 = 'alvaro-neves#stockshop.com'
email3 = 'filiallojas123#filiallojas.net'
You can do that with the method String#[] with an argument that is a regular expression.
r = /.*?\.(?:rb|com|net|br)(?!\.br)/
'giovanna.macedo#lojas100.com.br-215000695716b.ct.domain.com.br'[r]
#=> "giovanna.macedo#lojas100.com.br"
'alvaro-neves#stockshop.com-215000695716b.ct.domain.com.br'[r]
#=> "alvaro-neves#stockshop.com"
'filiallojas123#filiallojas.net-215000695716b.ct.domain.com.br'[r]
#=> "filiallojas123#filiallojas.net"
The regular expression reads as follows: "Match zero or more characters non-greedily (?), follow by a period, followed by 'rb' or 'com' or 'net' or 'br', which is not followed by .br. (?!\.br) is a negative lookahead.
Alternatively the regular expression can be written in free-spacing mode to make it self-documenting:
r = /
.*? # match zero or more characters non-greedily
\. # match '.'
(?: # begin a non-capture group
rb # match 'rb'
| # or
com # match 'com'
| # or
net # match 'net'
| # or
br # match 'br'
) # end non-capture group
(?! # begin a negative lookahead
\.br # match '.br'
) # end negative lookahead
/x # invoke free-spacing regex definition mode
This should work for your scenario:
expr = /^(.+\.(?:br|com|net))-[^']+(')$/
str = "email = 'giovanna.macedo#lojas100.com.br-215000695716b.ct.domain.com.br'"
str.gsub(expr, '\1\2')
Use the String#delete_suffix Method
This was tested with Ruby 3.0.2. Your mileage may vary with other versions that don't support String#delete_suffix or its related bang method. Since you're trying to remove the exact same suffix from all your emails, you can simply invoke #delete_suffix! on each of your strings. For example:
common_suffix = "-215000695716b.ct.domain.com.br".freeze
emails = [email1, email2, email3]
emails.each { _1.delete_suffix! common_suffix }
You can then validate your results with:
emails
#=> ["giovanna.macedo#lojas100.com.br", "alvaro-neves#stockshop.com", "filiallojas123#filiallojas.net"]
email1
#=> "giovanna.macedo#lojas100.com.br"
email2
#=> "alvaro-neves#stockshop.com"
email3
#=> "filiallojas123#filiallojas.net"
You can see that the array has replaced each value, or you can call each of the array's variables individually if you want to check that the strings have actually been modified in place.
String Methods are Usually Faster, But Your Mileage May Vary
Since you're dealing with String objects instead of regular expressions, this solution is likely to be faster at scale, although I didn't bother to benchmark all solutions to compare. If you care about performance, you can measure larger samples using IRB's new measure command, it took only 0.000062s to process the strings this way on my system, and String methods generally work faster than regular expressions at large scales. You'll need to do more extensive benchmarking if performance is a core concern, though.
Making the Call Shorter
You can even make the call shorter if you want. I left it a bit verbose above so you could see what the intent was at each step, but you can trim this to a single one-liner with the following block:
# one method chain, just wrapped to prevent scrolling
[email1, email2, email3].
map { _1.delete_suffix! "-215000695716b.ct.domain.com.br" }
Caveats
You Need Fixed-String Suffixes
The main caveat here is that this solution will only work when you know the suffix (or set of suffixes) you want to remove. If you can't rely on the suffixes to be fixed, then you'll likely need to pursue a regex solution in one way or another, even if it's just to collect a set of suffixes.
Dealing with Frozen Strings
Another caveat is that if you've created your code with frozen string literals, you'll need to adjust your code to avoid attempting in-place changes to frozen strings. There's more than one way to do this, but a simple destructuring assignment is probably the easiest to follow given your small code sample. Consider the following:
# assume that the strings in email1 etc. are frozen, but the array
# itself is not; you can't change the strings in-place, but you can
# re-assign new strings to the same variables or the same array
emails = [email1, email2, email3]
email1, email2, email3 =
emails.map { _1.delete_suffix "-215000695716b.ct.domain.com.br" }
There are certainly other ways to work around frozen strings, but the point is that while the now-common use of the # frozen_string_literal: true magic comment can improve VM performance or memory usage in large programs, it isn't always the best option for string-mangling code. Just keep that in mind, as tools like RuboCop love to enforce frozen strings, and not everyone stops to consider the consequences of such generic advice to the given problem domain.
I would just use the chomp(string) method like so:
mask = "-215000695716b.ct.domain.com.br"
email1.chomp(mask)
#=> "giovanna.macedo#lojas100.com.br"
email2.chomp(mask)
#=> "alvaro-neves#stockshop.com"
email3.chomp(mask)
#=> "filiallojas123#filiallojas.net"

Checking really fast if one of many strings exists in one of many other strings, in Perl

Let's say I have a set of 100,000 strings, each one around 20-50 bytes on average.
Then let's say I have another set of 100,000,000 strings, each one also around 20-50 bytes on average.
I'd like to go through all 100,000,000 strings from the second set and check if any one of the strings from the first set exist in any one string from the second set.
Example: string from first set: "abc", string from second set: "123abc123" = match!
I've tried using Perl's index(), but it's not fast enough. Is there a better way to do this type of matching?
I found Algorithm::AhoCorasik::XS on CPAN, which implements the classic, very efficient multiple string search Aho-Corasick algorithm (Same one used by grep -F), and it seems to be reasonably fast (About half the speed of an equivalent grep invocation):
Example script:
#!/usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use Algorithm::AhoCorasick::XS;
open my $set1, "<", "set1.txt";
my #needles = <$set1>;
chomp #needles;
my $search = Algorithm::AhoCorasick::XS->new(\#needles);
open my $set2, "<", "set2.txt";
while (<$set2>) {
chomp;
say if defined $search->first_match($_);
}
and using it (With randomly-generated test files):
$ wc -l set1.txt set2.txt
10000 set1.txt
500000 set2.txt
510000 total
$ time perl test.pl | wc -l
458414
real 0m0.403s
user 0m0.359s
sys 0m0.031s
$ time grep -Ff set1.txt set2.txt | wc -l
458414
real 0m0.199s
user 0m0.188s
sys 0m0.031s
You should use a regex alternation, like:
my #string = qw/abc def ghi/;
my $matcher = qr/#{[join '|', map quotemeta, sort #string]}/;
This should be faster than using index. But it can be made faster yet:
Up to a certain limit, depending on both the length and number of the strings, perl will build a trie for efficient matching; see e.g. https://perlmonks.org/?node_id=670558. You will want to experiment with how many strings you can include in a single regex to generate an array of regexes. Then combine those separate regexes into a single one (untested):
my #search_strings = ...;
my #matchers;
my $string_limit = 3000; # a guess on my part
my #strings = sort #search_strings;
while (my #subset = splice #strings, 0, $string_limit) {
push #matchers, qr/^.*?#{[join '|', map quotemeta, sort #subset]}/s;
}
my $matcher = '(?:' . join('|', map "(??{\$matchers[$_]})", 0..$#matchers) . ')';
$matcher = do { use re 'eval'; qr/$matcher/ };
/$matcher/ and print "line $. matched: $_" while <>;
The (??{...}) construct is needed to join the separate regexes; without it, the subregexes are all just interpolated and the joined regex compiled all together, removing the trie optimization. Each subregex starts with ^.*? so it searches the entire string; without that, the joined regex would have to invoke each subregex separately for each position in the string.
Using contrived data, I'm seeing about 3000 strings searched per second with this approach in a not very fast vm. Using the naive regex is less than 50 strings per second. Using grep, as suggested in a comment by Shawn, is faster (about 4200 strings per second for me) but gives you less control if you want to do things like identify which strings matched or at what positions.
You may want to have a look at https://github.com/leendo/hello-world .
Its parallel processing makes it really fast. Either just type in all search terms individually or as || conjunctions, or (better) adapt it to run the second set programatically in one go.
Here is an idea: you could partition the dictionary into lists of words that have the same 2 or 3 letter prefix. You would then iterate on the large set and for each position in each string, extract the prefix and try and match the strings that have this prefix.
You would use hashtable to store the lists with O(1) lookup time.
If some words in the dictionary are shorter than the prefix length, you would have to special case short words.
Making prefixes longer will make the hashtable larger but the lists shorter, improving the test time.
I have no clue if this can be implemented efficiently in Perl.
The task is quite simple and perhaps used on everyday basis around the globe.
Please see following code snippet for one of many possible implementations
use strict;
use warnings;
use feature 'say';
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
my $re;
GetOptions(
'sample|s=s' => \$opt{sample},
'debug|d' => \$opt{debug},
'help|h' => \$opt{help},
'man|m' => \$opt{man}
) or pod2usage(2);
pod2usage(1) if $opt{help};
pod2usage(-verbose => 2) if $opt{man};
pod2usage("$0: No files given.") if ((#ARGV == 0) && (-t STDIN));
$re = read_re($opt{sample});
say "DEBUG: search pattern is ($re)" if $opt{debug};
find_in_file($re);
sub find_in_file {
my $search = shift;
while( <> ) {
chomp;
next unless /$search/;
say;
}
}
sub read_re {
my $filename = shift;
open my $fh, '<', $filename
or die "Couldn't open $filename";
my #data = <$fh>;
close $fh;
chomp #data;
my $re = join('|', #data);
return $re;
}
__END__
=head1 NAME
file_in_file.pl - search strings of one file in other
=head1 SYNOPSIS
yt_video_list.pl [options] -s sample.txt dbfile.txt
Options:
-s,--sample search pattern file
-d,--debug debug flag
-h,--help brief help message
-m,--man full documentation
=head1 OPTIONS
=over 4
=item B<-s,--sample>
Search pattern file
=item B<-d,--debug>
Print debug information.
=item B<-h,--help>
Print a brief help message and exits.
=item B<-m,--man>
Prints the manual page and exits.
=back
B<This program> seaches patterns from B<sample> file in B<dbfile>
=cut

Regular Expression replacement to convert Less mixins to Scss

I'm looking to convert Less mixin calls to their equivalents in Scss:
.mixin(); should become #mixin();
.mixin(0); should become #mixin(0);
.mixin(0; 1; 2); should become #mixin(0, 1, 2);
I'm having the most difficulty with the third example, as I essentially need to match n groups separated by semicolons, and replace those with the same groups separated by commas. I suppose this relies on some sort of repeating groups functionality in regexes that I'm not familiar with.
It's not simply enough to simply replace semicolons within paren - I need a regex that will only match the \.[\w\-]+\(.*\) format of mixins, but obviously with some magic in the second match group to handle the 3rd example above.
I'm doing this in Ruby, so if you're able to provide replacement syntax that's compatible with gsub, that would be awesome. I would like a single regex replacement, something that doesn't require multiple passes to clean up the semicolons.
I suggest adding two capturing groups round the subvalues you need and using an additional gsub in the first gsub block to replace the ; with , only in the 2nd group.
See
s = ".mixin(0; 1; 2);"
puts s.gsub(/\.([\w\-]+)(\(.*\))/) { "##{$1}#{$2.gsub(/;/, ',')}" }
# => #mixin(0, 1, 2);
The pattern details:
\. - a literal dot
([\w\-]+) - Group 1 capturing 1 or more word chars ([a-zA-Z0-9_]) or -
(\(.*\)) - Group 2 capturing a (, then any 0+ chars other than linebreak symbols as many as possible up to the last ) and the last ). NOTE: if there are multiple values, use lazy matching - (\(.*?\)) - here.
Here you go:
less_style = ".mixin(0; 1; 2);"
# convert the first period to #
less_style.gsub! /^\./, '#'
# convert the inner semicolons to commas
scss_style = less_style.gsub /(?<=[\(\d]);/, ','
scss_style
# => "#mixin(0, 1, 2);"
The second regex is using positive lookbehinds. You can read about those here: http://www.regular-expressions.info/lookaround.html
I also use this neat web app to play around with regexes: http://rubular.com/
This will get you a single pass through gsub:
".mixin(0; 1; 2);".gsub(/(?<!\));|\./, ";" => ",", "." => "#")
=> "#mixin(0, 1, 2);"
It's an OR regex with a hash for the replacement parameters.
Assuming from your example that you just want to replace semicolons not following close parens(negative lookbehind): (?<!\));
You can modify/build on this with other expressions. Even add more OR conditions to the regex.
Also, you can use the block version of gsub if you need more options.

How exactly does this work string.split(/\?|\.|!/).size?

I know, or at least I think I know, what this does (string.split(/\?|\.|!/).size); splits the string at every ending punctuation into an array and then gets the size of the array.
The part I am confused with is (/\?|\.|!/).
Thank you for your explanation.
Regular expressions are surrounded by slashes / /
The backslash before the question mark and dot means use those characters literally (don't interpret them as special instructions)
The vertical pipes are "or"
So you have / then question mark \? then "or" | then period \. then "or" | then exclamation point ! then / to end the expression.
/\?|\.|!/
It's a Regular Expression. That particular one matches any '?', '.' or '!' in the target string.
You can learn more about them here: http://regexr.com/
A regular expression splitting on the char "a" would look like this: /a/. A regular expression splitting on "a" or "b" is like this: /a|b/. So splitting on "?", "!" and "." would look like /?|!|./ - but it does not. Unfortunately, "?", and "." have special meaning in regexps which we do not want in this case, so they must be escaped, using "\".
A way to avoid this is to use Regexp.union("?","!",".") which results in /\?|!|\./
(/\?|\.|!/)
Working outside in:
The parentheses () captures everything enclosed.
The // tell Ruby you're using a Regular Expression.
\? Matches any ?
\. Matches any .
! Matches any !
The preceding \ tells Ruby we want to find these specific characters in the string, rather than using them as special characters.
Special characters (that need to be escaped to be matched) are:
. | ( ) [ ] { } + \ ^ $ * ?.
There is a nice guide to Ruby RegEx at:
http://rubular.com/ & http://www.tutorialspoint.com/ruby/ruby_regular_expressions.htm
For SO answers that involve regular expressions, I often use the "extended" mode, which makes them self-documenting. This one would be:
r = /
\? # match a question mark
| # or
\. # match a period
| # or
! # match an explamation mark
/x # extended mode
str = "Out, damn'd spot! out, I say!—One; two: why, then 'tis time to " +
"do't.—Hell is murky.—Fie, my lord, fie, a soldier, and afeard?"
str.split(r)
#=> ["Out, damn'd spot",
# " out, I say",
# "—One; two: why, then 'tis time to do't",
# "—Hell is murky",
# "—Fie, my lord, fie, a soldier, and afeard"]
str.split(r).size #=> 5
#steenslag mentioned Regexp::union. You could also use Regexp::new to write (with single quotes):
r = Regexp.new('\?|\.|!')
#=> /\?|\.|!/
but it really doesn't buy you anything here. You might find it useful in other situations, however.

Regex to capture string into ruby method params

I Looking for an Regex to capture this examples of strings:
first_paramenter, first_hash_key: 'class1 class2', second_hash_key: true
first_argument, single_hash_key: 'class1 class2'
first_argument_without_second_argument
The pattern rules are:
The string must start some word (the first parameter) /^(\w+)/
The second parameter is optional
If second parameter provided, must have one comma after fisrt parameter
The second argument is an hash, with keys and values. Values can be true, false or an string enclosed by quotes
The hash keys must start with letter
I'm using this regex, but it matches with the only second example:
^(\w+),(\s[a-z]{1}[a-z_]+:\s'?[\w\s]+'?,?)$
I'd go with something like:
^(\w+)(?:, ([a-z]\w+): ('[^']*')(?:, ([a-z]\w+): (\w+))?)?
Here's a Rubular example of it.
(?:...) create non-capturing groups which we can easily test for existence using ?. That makes it easy to test for optional chunks.
([a-z]\w+) is an easy way to say "it must start with a letter" while allowing normal alpha, digits and "_".
As far as testing for "Values can be true, false or an string enclosed by quotes", I'd do that in code after capturing. It's way too easy to create a complex pattern, and then be unable to maintain it later. It's better to use simple ones, then look to see whether you got what you expected, than to try to enforce it inside the regex.
in the third example, your regex return 5 matches. It would be better if return only one. It's possible?
I'm not sure what you're asking. This will return a single capture for each, but why you'd want that makes no sense to me if you're capturing parameters to send to a method:
/^(\w+(?:, [a-z]\w+: '[^']*'(?:, [a-z]\w+: \w+)?)?)/
http://rubular.com/r/GLVuSOieI6
There is frequently a choice to be made between attacking an entire string with a single regex or breaking the string up with one or more String methods, and then going after each piece separately. The latter approach often makes debugging and testing easier, and may also make the code intelligible to mere mortals. It's always a judgement call, of course, but I think this problem lends itself well to the divide and conquer approach. This is how I'd do it.
Code
def match?(str)
a = str.split(',')
return false unless a.shift.strip =~ /^\w+$/
a.each do |s|
return false unless ((key_val = s.split(':')).size == 2) &&
key_val.first.strip =~ /^[a-z]\w*$/ &&
key_val.last.strip =~ /^(\'.*?\'|true|false)$/
end
true
end
Examples
match?("first_paramenter, first_hash_key: 'class1 class2',
second_hash_key: true")
#=>true
match?("first_argument, single_hash_key: 'class1 class2'")
#=>true
match?("first_argument_without_second_argument")
#=>true
match?("first_parameter, first_hash_key: 7")
#=>false
match?("dogs and cats, first_hash_key: 'class1 class2'")
#=>false
match?("first_paramenter, first_hash_key: 'class1 class2',
second_hash_key: :true")
#=>false
You've got the basic idea, you have a bunch of small mistakes in there
/^(\w+)(,\s[a-z][a-z_]+:\s('[^']*'|true|false))*$/
explained:
/^(\w+) # starts with a word
(
,\s # the comma goes _inside_ the parens since its optional
[a-z][a-z_]+:\s # {1} is completely redundant
( # use | in a capture group to allow different possible keys
'[^']*' | # note that '? doesn't make sure that the quotes always match
true |
false
)
)*$/x # can have 0 or more hash keys after the first word

Resources