Ruby Regex doesn't terminate [duplicate] - ruby

I'm trying to execute this code :
import re
pattern = r"(\w+)\*([\w\s]+)*/$"
re_compiled = re.compile(pattern)
results = re_compiled.search('COPRO*HORIZON 2000 HOR')
print(results.groups())
But Python does not respond. The process takes 100% of the CPU and does not stop. I've tried this both on Python 2.7.1 and Python 3.2 with identical results.

Your regex runs into catastrophic backtracking because you have nested quantifiers (([...]+)*). Since your regex requires the string to end in / (which fails on your example), the regex engine tries all permutations of the string in the vain hope to find a matching combination. That's where it gets stuck.
To illustrate, let's assume "A*BCD" as the input to your regex and see what happens:
(\w+) matches A. Good.
\* matches *. Yay.
[\w\s]+ matches BCD. OK.
/ fails to match (no characters left to match). OK, let's back up one character.
/ fails to match D. Hum. Let's back up some more.
[\w\s]+ matches BC, and the repeated [\w\s]+ matches D.
/ fails to match. Back up.
/ fails to match D. Back up some more.
[\w\s]+ matches B, and the repeated [\w\s]+ matches CD.
/ fails to match. Back up again.
/ fails to match D. Back up some more, again.
How about [\w\s]+ matches B, repeated [\w\s]+ matches C, repeated [\w\s]+ matches D? No? Let's try something else.
[\w\s]+ matches BC. Let's stop here and see what happens.
Darn, / still doesn't match D.
[\w\s]+ matches B.
Still no luck. / doesn't match C.
Hey, the whole group is optional (...)*.
Nope, / still doesn't match B.
OK, I give up.
Now that was a string of just three letters. Yours had about 30, trying all permutations of which would keep your computer busy until the end of days.
I suppose what you're trying to do is to get the strings before/after *, in which case, use
pattern = r"(\w+)\*([\w\s]+)$"

Try re2 or any other regular expression engine base on automata theory. The one in a current python re module is a simple and slow backtracking engine (for now, things may change in future). But automata based engines have some restriction, it wouldn't allow you to use backreferences for example. Collate with this re2 syntax page to find out will it satisfy your needs or not.

Interestingly, Perl runs it very quickly
-> perl -e 'print "Match\n" if "COPRO*HORIZON 2000 HOR" =~ m|(\w+)\*([\w\s]+)*/$|'
-> perl -e 'print "Match\n" if "COPRO*HORIZON 2000 HOR/" =~ m|(\w+)\*([\w\s]+)*/$|'
Match

Looks like it might be something in your pattern. I'm not sure what you are trying to do with the last '*' in your expression. The following code seems to work for me:
import re
pattern = r"(\w+)\*([\w\s]+)$"
re_compiled = re.compile(pattern)
results = re_compiled.search('COPRO*HORIZON 2000 HOR')
print(results.groups())

Related

Ruby Koans - Why are repetition operators called "greedy" [duplicate]

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

meaning of a `+` following a `*`, when the latter is used as a quantifier in a regular expression

Today I came across the following regular expression and wanted to know what Ruby would do with it:
> "#a" =~ /^[\W].*+$/
=> 0
> "1a" =~ /^[\W].*+$/
=> nil
In this instance, Ruby seems to be ignoring the + character. If that is incorrect, I'm not sure what it is doing with it. I'm guessing it's not being interpreted as a quantifier, since the * is not escaped and is being used as a quantifier. In Perl/Ruby regexes, sometimes when a character (e.g., -) is used in a context in which it cannot be interpreted as a special character, it is treated as a literal. But if that was happening in this case, I would expect the first match to fail, since there is no + in the lvalue string.
Is this a subtly correct use of the + character? Is the above behavior a bug? Am I missing something obvious?
Well, you can certainly use a + after a *. You can read a bit about it on this site. The + after the * is called a possessive quantifier.
What it does? It prevents * from backtracking.
Ordinarily, when you have something like .*c and using this to match abcde, the .* will first match the whole string (abcde) and since the regex cannot match c after the .*, the engine will go back one character at a time to check if there is a match (this is backtracking).
Once it has backtracked to c, you will get the match abc from abcde.
Now, imagine that the engine has to backtrack a few hundred characters, and if you have nested groups and multiple * (or + or the {m,n} form), you can quickly end up with thousands, millions of characters to backtrack, called catastrophic backtracking.
This is where possessive quantifiers come in handy. They actually prevent any form of backtracking. In the above regex I mentioned, abcde will not be matched by .*+c. Once .*+ has consumed the whole string, it cannot backtrack and since there's no c at the end of the string, the match fails.
So, another possible use of possessive quantifiers is that they can improve the performance of some regexes, provided the engine can support it.
For your regex /^[\W].*+$/, I don't think that there's any improvement (maybe a tiny little improvement) that the possessive quantifier provides though. And last, it might easily be rewritten as /^\W.*+$/.

Understanding negative look aheads in regular expressions

I want to match urls that do NOT contain the string 'localhost' using Ruby regex
Based on answers and comments here, I put together two solutions, both of which seem to work:
Solution A:
(?!.*localhost)^.*$
Example: http://rubular.com/r/tQtbWacl3g
Solution B:
^((?!localhost).)*$
Example: http://rubular.com/r/2KKnQZUMwf
The problem is that I don't understand what they're doing. For example, according to the docs, ^ can be used in various ways:
[^abc] Any single character except: a, b, or c
^ Start of line
But I don't get how it's being applied here.
Can someone breakdown these expressions for me, and how they differ from one another?
In both of your cases, ^ is just the start of the line (since it's not used inside a character class). Since both ^ and the lookahead are zero-width assertions, we can switch them around in the first case - I think that makes it a bit easier to explain:
^(?!.*localhost).*$
The ^ anchors the expression to the beginning of the string. The lookahead then starts from that position and tries to find localhost anywhere the string (the "anywhere" is taken care of by the .* in front of localhost). If that localhost can be found, the subexpression of the lookahead matches and therefore the negative lookahead causes the pattern to fail. Since the lookahead is bound to start at the beginning of the string by the adjacent ^ this means, the pattern overall cannot match. If, however the .*localhost does not match (and hence localhost does not occur in the string), the lookahead succeeds, and the .*$ simply takes care of matching the rest of the string.
Now the other one
^((?!localhost).)*$
This time the lookahead only checks at the current position (there is no .* inside it). But the lookahead is repeated for every single character. This way it does check every single position again. Here is roughly what happens: the ^ makes sure that we're starting at the beginning of the string again. The lookahead checks whether the word localhost is found at that position. If not, all is well, and . consumes one character. The * then repeats both of those steps. We are now one character further in the string, and the lookahead checks whether the second character starts the word localhost - again, if not, all is well, and . consumes another character. This is done for every single character in the string, until we reach the end.
In this particular case both methods are equivalent, and you could select one based on performance (if it matters) or readability (if not; probably the first one). However, in other cases the second variant is preferable, because it allows you to do this repetition for a fixed part of the string, whereas the first variant will always check the entire string.
You can get them easily explained online. The first:
NODE EXPLANATION
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
localhost 'localhost'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------
' '
And the second:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
localhost 'localhost'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------
As an aside comment, these two solutions are slow. A better way is to use:
^(?:[^l]+|l(?!ocalhost))+
In other words: all characters that are not a l or a l not followed by ocalhost
This will give you a better result since you don't have to check each positions. (For an url like http://localhost:1234/toto this kind of pattern will fail in ~15 steps vs ~50 steps for the two other patterns)
You can improve this pattern using atomic groups and possessive quantifiers to forbid backtracks:
^(?>[^l]++|l(?!ocalhost))++
Note that in your particular case you can speed up your pattern considering that you only want to check the host part of the url. Example:
^http:\/\/(?>[^l\s\/]++|l(?!ocalhost))++(?>\/\S*+|$)
according to the docs, ^ can be used in various ways:
[^abc] Any single character except: a, b, or c
^ Start of line
But I don't get how it's being applied here.
In the regex
(?!.*localhost)^.*$
The ^ is not inside any brackets, so the second one applies. Here is a trivial example:
/^x/
That regex says to match the start of the line, followed by the letter x. So it will match lines like this:
xcellent
x-ray
However, the regex will not match the lines:
axb
excellent
...because the x does not appear directly after the start of the line. You may wonder why 'axb' doesn't match. After all 'a' is the start of the line, and it is followed by an 'x'. However, 'start of the line' is just to the left of the first character, like this:
|
V
axb
^ is called a zero-width match because it matches the slim sliver just to the left of the 'a', e.g. between the starting quote mark and the 'a' in "axb". There's not really any space there, so ^ matches something that is 0 width.
Here is another example:
/x^/
That says to match the character x followed by the start of the line. Well, no line can have an x first and then the start of the line second, so that won't ever match anything.
Now your regex:
(?!.*localhost)^.*$
Like the 'start of line' ^, a lookahead is zero-width. What that means is that the lookahead scans the string looking for the match, but when it finds the match, it comes back to the beginning of the string, and then looks for the rest of the regex:
^.*$
One word of advice, when a regex requires lookarounds(lookaheads or lookbehinds), 99% of the time there are easier ways to do what you want. For instance, you could write:
url = "....."
if url.index('http') == 0
#then the line starts with 'http'
else
#the line doesn't start with http
end
That's much easier to read, and it doesn't require trying to decipher a complex regex.

Multi-Line Regex: Find A where B is absent

I have been looking through a lot on Regex lately and have seen a lot of answers involving the matching of one word, where a second word is absent. I have seen a lot of Regex Examples where I can have a Regex search for a given word (or any more complex regex in its place) and find where a word is missing.
It seems like the works very well on a line by line basis, but after including the multi-line mode it still doesn't seem to match properly.
Example: Match an entire file string where the word foo is included, but the word bar is absent from the file. What I have so far is (?m)^(?=.*?(foo))((?!bar).)*$ which is based off the example link. I have been testing with a Ruby Regex tester, but I think it is a open ended regex problem/question. It seems to match smaller pieces, I would like to have it either match/not match on the entire string as one big chunk.
In the provided example above, matches are found on a line by line basis it seems. What changes need to be made to the regex so it applies over the ENTIRE string?
EDIT: I know there are other more efficient ways to solve this problem that doesn't involve using a regex. I am not looking for a solution to the problem using other means, I am asking from a theoretical regex point of view. It has a multi-line mode (which looks to "work"), it has negative/positive searching which can be combined on a line by line basis, how come combining these two principals doesn't yield the expected result?
Sawa's answer can be simplified, all that's needed is a positive lookahead, a negative lookahead, and since you're in multiline mode, .* takes care of the rest:
/(?=.*foo)(?!.*bar).*/m
Multiline means that . matches \n also, and matches are greedy. So the whole string will match without the need for anchors.
Update
#Sawa makes a good point for the \A being necessary but not the \Z.
Actually, looking at it again, the positive lookahead seems unnecessary:
/\A(?!.*bar).*foo.*/m
A regex that matches an entire string that does not include foo is:
/\A(?!.*foo.*).*\z/m
and a regex that matches from the beginning of an entire string that includes bar is:
/\A.*bar/m
Since you want to satisfy both of these, take a conjunction of these by putting one of them in a lookahead:
/\A(?=.*bar)(?!.*foo.*).*\z/m

Regular expression syntax

I have a similar problem, to a previously asked question. But similar practices apparently do not produce similar results.
Previous Question
New question - I want to match the lines beginning in T as the first match, and the following lines beginning with X as the second match (as a whole string, to be later matched by another regex)
What I have so far is (^T(\d+)\n(.*?)(?:the_problem)/m) I don't know what to replace "the_problem" with, or even if that is the issue. I assumed some rendition (?:\n|\z), but apparently not. Everything I tried, would not count the next occurrence of ^T(\d+) as the start of a new group, and continue to capture all of the lines between each occurrence, at the same time.
Sample text;
T01C0.025
T02C0.035
T03C0.055
T04C0.150
T05C0.065
T06C0.075
%
G05
G90
T01
X011200Y004700
X011200Y009700
X018500Y011200
X013500Y-011200
X023800Y019500
T02
X034800Y017800
X-033800Y-017800
X032800Y017800
T03
X036730Y003000
X038700Y003000
X040668Y-003000
X059230Y003000
T04
X110580Y017800
X023800Y027300
X095500Y028500
X005500Y-006500
X021500Y-006500
T05
X003950Y002000
X003950Y004500
X003950Y007000
T06
X026300Y027300
M30
I only want to capture the shorter version of T01, T02,...T0n, not the longer version at the top, then the entire collection of ^X(-?\d+)Y(-?\d+) that follows it, as another match.
Result 1.
Match 1. T01
Match 2. X011200Y004700
X011200Y009700
X018500Y011200
X013500Y-011200
X023800Y019500
Result 2.
Match 1. T02
Match 2. X034800Y017800
X-033800Y-017800
X032800Y017800
Result 3.
Match 1. T03
Match 2. X036730Y003000
X038700Y003000
....etc....
Thanks in advance for any help ;-) Note: I prefer to use raw Ruby, without extensions or plugins. My version of ruby is 1.8.6.
Try this instead:
^(T[^\s]+)[\n\r\s]((?:(?:X\S+)[\n\r\s])+)
It makes the groups for the X lines into non-capturing groups, then puts all the repetitions of the final pattern into a single group. All the X lines will be in a single capture.
You can test this using Rubular (an indispensable tool for developing regular expressions) http://rubular.com/r/PRnurKy64Q
this seems to work...
^(T[^\s]+)[\n\r\s]((X[^\s]+)[\n\r\s]){1,}
I'm not totally sure I understand your problem, but I'll give this a shot. It looks like you want:
/(^T\d+$(^X[-A-Z\d]+$)+)*/g
This will have to be run under multiline mode so that ^ and $ match after and before newlines. Word of caution: I don't have much practice with mulitline regex, so you might want to do a sanity check on the use of ^ and $.
Also, I notice you didn't include the lines similar to T01C0.025 in your sample results, so I made the T\d+ assumption based on that.

Resources