ruby regex hangs - ruby

I wrote a ruby script to process a large amount of documents and use the following URI to extract URIs from a document's string representation:
#Taken from: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
URI_REGEX = /
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
\/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}\/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)/xi
It works pretty well for 99.9 percent of all documents but always hangs up my script when it encounters the following token in of the documents: token = "synsem:local:cat:(subcat:SubMot,adjuncts:Adjs,subj:Subj),"
I am using the standard ruby regexp oeprator: token =~ URI_REGEX and I don't get any exception or error message.
First I tried to solve the problem encapsulating the regex evaluation into a Timeout::timeoutblock, but this degrades performance to much.
Any other ideas on how to solve this problem?

Your problem is catastrophic backtracking. I just loaded your regex and your test string into RegexBuddy, and it gave up after 1.000.000 iterations of the regex engine (and from the looks of it, it would have gone on for many millions more had it not aborted).
The problem arises because some parts of your text can be matched by different parts of your regex (which is horribly complicated and painful to read); it seems that the "One or more:" part of your regex and the "End with:" part struggle over the match (when it's not working), trying out millions of permutations that all fail.
It's difficult to suggest a solution without knowing what the rules for matching a URI are (which I don't). All this balancing of parentheses suggests to me that regexes may not be the right tool for the job. Maybe you could break down the problem. First use a simple regex to find everything that looks remotely like a URI, then validate that in a second step (isn't there a URI parser for Ruby of some sort?).
Another thing you might be able to do is to prevent the regex engine from backtracking by using atomic groups. If you can change some (?:...) groups into (?>...) groups, that would allow the regex to fail faster by disallowing backtracking into those groups. However, that might change the match and make it fail on occasions where backtracking is necessary to achieve a match at all - so that's not always an option.

Why reinvent the wheel?
require 'uri'
uri_list = URI.extract("Text containing URIs.")

URI.extract("Text containing URIs.") is the best solution if you only need the URIs.
I finally used pat = URI::Parser.new.make_regexp('http')to get the built-in URI parsing regexp and use it in match = str.match(pat, start_pos) to iteratively parse the input text URI by URI. I am doing this because I also need the URI positions in the text and the returned match object gives me this information match.begin(0).

Related

Regex for matching everything before trailing slash, or first question mark?

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?
If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:

Regex - Matching text AFTER certain characters

I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

Regex Tag-Within-Tag

I have a fairly simple regex problem for a little personal experiment that I haven't quite figured out.
In a string, I might have several <tag>[some characters here] that I need to match. The obvious way to do it would be with a /<tag>\[.*?\]/ regex, to match any characters after the <tag>[ and before the ].
I'd like to be able to have <tag>s within <tag>s, however. This causes a problem. If I had the following:
<tag>[some characters <tag>[in here] to match]
the regex would stop matching as soon as it reached the first closing-bracket, and completely fail to match the last part of the statement. I've tried to solve the problem by telling the regex to ignore any internal <tag>s, so I can do a match on the stripped contents later. I haven't quite gotten it working. The closest I've come is:
/<tag>\[(.*?(?:<tag>\[.*?\])*?.*?)\]/
which doesn't quite work. I would hope that it would match any number of characters, and any inner tags if they exist. It still has trouble with that first closing bracket, however.
Maybe somebody who's better at regular expressions knows a good solution to this.
Though you should probably drop regex and do this manually if the mini-language becomes more complex, you can use recursive regex.
Your regex would look something like this:
/(?<reg>(\w+\[([^\]\[]|\g<reg>)*\]))/
You can see it in action here: http://rubular.com/r/9F7isgZpj9
Here is the regex broken down to its parts:
(?<reg>( # start a regex named "reg"
\w+ # the tag name
\[ # open bracket
( # which can contain
[^\]\[] # non-bracket characters
| # or
\g<reg> # sub-tags (this is where the magic happens)
)* # zero or more times
\] # close the tag
)
)

Ruby RegEx issue

I'm having a problem getting my RegEx to work with my Ruby script.
Here is what I'm trying to match:
http://my.test.website.com/{GUID}/{GUID}/
Here is the RegEx that I've tested and should be matching the string as shown above:
/([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)/
3 capturing groups:
group 1: ([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)
group 2: (\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)
group 3: ([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])
Ruby is giving me an error when trying to validate a match against this regex:
empty range in char class: (My RegEx goes here) (SyntaxError)
I appreciate any thoughts or suggestions on this.
You could simplify things a bit by using URI to deal parsing the URL, \h in the regex, and scan to pull out the GUIDs:
uri = URI.parse(your_url)
path = uri.path
guids = path.scan(/\h{8}-\h{4}-\h{4}-\h{4}-\h{12}/)
If you need any of the non-path components of the URL the you can easily pull them out of uri.
You might need to tighten things up a bit depending on your data or it might be sufficient to check that guids has two elements.
You have several errors in your RegEx. I am very sleepy now, so I'll just give you a hint instead of a solution:
...[\/\/[0-9a-fA-F]....
the first [ does not belong there. Also, having \/\/ inside [] is unnecessary - you only need each character once inside []. Also,
...[-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}...
is greedy, and includes a period - indeed, includes all chars (AFAICS) that can come after it, effectively swallowing the whole string (when you get rid of other bugs). Consider {2,256}? instead.

ruby regex need to match only if a URL is found in a string once

I'm trying to add conditional logic to determine if there's one regex match for a URL in a string. Here's an example of the string:
string_to_match = "http://www.twitpic.com/23456 ran to catch the bus, http://www.twitpic.com/3456 dodged a bullet at work."
I only want to match if I determine there's one URL in the string, so the above string wouldn't be a match in the case I'm trying to solve. I thought something like this would work:
if string_to_match =~ /[http\:\/\/]?/
puts "you're matching more then once. bad man!"
end
But it doesn't! How do I determine that there's only one match in a string?
The answer from Mladen is fine (counting the return from scan), but regular expressions already include the idea of matching the same thing multiple times or a particular number of times. In your case, you want to print the warning if your text occurs 2 or more times:
/(http:\/\/.+?){2,}/
Use .+ or .*, depending on whether you want to require the URL to have some content or not. As it stands, the .+? will match 1 or more characters in a non-greedy fashion, which is what you want. A greedy quantifier would gobble up the entire string on the first try and then have to do a bunch of backtracking before ultimately finding multiple URLs.
Take a look at String#scan, you can use it this way:
if string_to_match.scan(/[http\:\/\/]/).count > 1
puts "you're matching more then once. bad man!"
end
you could do it like this:
if string_to_match =~ /((http:\/\/.*?)http:\/\/)+/
this would match only if you have 2 (or more) occurrences of http://

Resources