I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?
If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:
I want to be able to match all the following cases below using Ruby 1.8.7.
/pages/multiedit/16801,16809,16817,16825,16833
/pages/multiedit/16801,16809,16817
/pages/multiedit/16801
/pages/multiedit/1,3,5,7,8,9,10,46
I currently have:
\/pages\/multiedit\/\d*
This matches upto the first set of numbers. So for example:
"/pages/multiedit/16801,16809,16817,16825,16833"[/\/pages\/multiedit\/\d*/]
# => "/pages/multiedit/16801"
See http://rubular.com/r/ruFPx5yIAF for example.
Thanks for the help, regex gods.
\/pages\/multiedit\/\d+(?:,\d+)*
Example: http://rubular.com/r/0nhpgki6Gy
Edit: Updated to not capture anything... Although the performance hit would be negligible. (Thanks Tin Man)
The currently accepted answer of
\/pages\/multiedit\/[\d,]+
may not be a good idea because that will also match the following strings
.../pages/multiedit/,,,
.../pages/multiedit/,1,
My answer requires there be at least one digit before the first comma, and at least one digit between commas, and it must end with a digit.
I'd use:
/\/pages\/multiedit\/[\d,]+/
Here's a demonstration of the pattern at http://rubular.com/r/h7VLZS1W1q
[\d,]+ means "find one or more numbers or commas"
The reason \d* doesn't work is it means "find zero or more numbers". As soon as the pattern search runs into a comma it stops. You have to tell the engine that it's OK to find numbers and commas.
I just gone through the concept Zero-Width Assertions from the documentation. And some quick questions comes into my mind-
why such name Zero-Width Assertions?
How the Look-ahead and look-behind concept supports such
Zero-Width Assertions concept?
What such ?<=s,<!s,=s,<=s - 4 symbols are instructing inside the pattern? can you help me here to focus to understand what is actually going on
I also tried some tiny codes to understand the logic, but not that much confident with the output of those:
irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"
irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"
irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"
Can anyone help me here to understand?
EDIT
Here i have tried two snippets one with "Zero-Width Assertions" concepts as below:
irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
and the other is without "Zero-Width Assertions" concepts as below:
irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"
Both the above produces same output,now internally how the both regexp move by their own to produce output- could you help me to visualize?
Thanks
Regular expressions match from left to right, and move a sort of "cursor" along the string as they go. If your regex contains a regular character like a, this means: "if there's a letter a in front of the cursor, move the cursor ahead one character, and keep going. Otherwise, something's wrong; back up and try something else." So you might say that a has a "width" of one character.
A "zero-width assertion" is just that: it asserts something about the string (i.e., doesn't match if some condition doesn't hold), but it doesn't move the cursor forwards, because its "width" is zero.
You're probably already familiar with some simpler zero-width assertions, like ^ and $. These match the start and end of a string. If the cursor isn't at the start or end when it sees those symbols, the regex engine will fail, back up, and try something else. But they don't actually move the cursor forwards, because they don't match characters; they only check where the cursor is.
Lookahead and lookbehind work the same way. When the regex engine tries to match them, it checks around the cursor to see if the right pattern is ahead of or behind it, but in case of a match, it doesn't move the cursor.
Consider:
/(?=foo)foo/.match 'foo'
This will match! The regex engine goes like this:
Start at the beginning of the string: |foo.
The first part of the regex is (?=foo). This means: only match if foo appears after the cursor. Does it? Well, yes, so we can proceed. But the cursor doesn't move, because this is zero-width. We still have |foo.
Next is f. Is there an f in front of the cursor? Yes, so proceed, and move the cursor past the f: f|oo.
Next is o. Is there an o in front of the cursor? Yes, so proceed, and move the cursor past the o: fo|o.
Same thing again, bringing us to foo|.
We reached the end of the regex, and nothing failed, so the pattern matches.
On your four assertions in particular:
(?=...) is "lookahead"; it asserts that ... does appear after the cursor.
1.9.3p125 :002 > 'jump june'.gsub(/ju(?=m)/, 'slu')
=> "slump june"
The "ju" in "jump" matches because an "m" comes next. But the "ju" in "june" doesn't have an "m" next, so it's left alone.
Since it doesn't move the cursor, you have to be careful when putting anything after it. (?=a)b will never match anything, because it checks that the next character is a, then also checks that the same character is b, which is impossible.
(?<=...) is "lookbehind"; it asserts that ... does appear before the cursor.
1.9.3p125 :002 > 'four flour'.gsub(/(?<=f)our/, 'ive')
=> "five flour"
The "our" in "four" matches because there's an "f" immediately before it, but the "our" in "flour" has an "l" immediately before it so it doesn't match.
Like above, you have to be careful with what you put before it. a(?<=b) will never match, because it checks that the next character is a, moves the cursor, then checks that the previous character was b.
(?!...) is "negative lookahead"; it asserts that ... does not appear after the cursor.
1.9.3p125 :003 > 'child children'.gsub(/child(?!ren)/, 'kid')
=> "kid children"
"child" matches, because what comes next is a space, not "ren". "children" doesn't.
This is probably the one I get the most use out of; finely controlling what can't come next comes in handy.
(?<!...) is "negative lookbehind"; it asserts that ... does not appear before the cursor.
1.9.3p125 :004 > 'foot root'.gsub(/(?<!r)oot/, 'eet')
=> "feet root"
The "oot" in "foot" is fine, since there's no "r" before it. The "oot" in "root" clearly has an "r".
As an additional restriction, most regex engines require that ... has a fixed length in this case. So you can't use ?, +, *, or {n,m}.
You can also nest these and otherwise do all kinds of crazy things. I use them mainly for one-offs I know I'll never have to maintain, so I don't have any great examples of real-world applications handy; honestly, they're weird enough that you should try to do what you want some other way first. :)
Afterthought: The syntax comes from Perl regular expressions, which used (? followed by various symbols for a lot of extended syntax because ? on its own is invalid. So <= doesn't mean anything by itself; (?<= is one entire token, meaning "this is the start of a lookbehind". It's like how += and ++ are separate operators, even though they both start with +.
They're easy to remember, though: = indicates looking forwards (or, really, "here"), < indicates looking backwards, and ! has its traditional meaning of "not".
Regarding your later examples:
irb(main):002:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
irb(main):003:0> "foresight".sub(/ight/, 'ee')
=> "foresee"
Yes, these produce the same output. This is that tricky bit with using lookahead:
The regex engine has tried some things, but they haven't worked, and now it's at fores|ight.
It checks (?!s). Is the character after the cursor s? No, it's i! So that part matches and the matching continues, but the cursor doesn't move, and we still have fores|ight.
It checks ight. Does ight come after the cursor? Well, yes, it does, so move the cursor: foresight|.
We're done!
The cursor moved over the substring ight, so that's the full match, and that's what gets replaced.
Doing (?!a)b is useless, since you're saying: the next character must not be a, and it must be b. But that's the same as just matching b!
This can be useful sometimes, but you need a more complex pattern: for example, (?!3)\d will match any digit that isn't a 3.
This is what you want:
1.9.3p125 :001 > "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"
This asserts that s doesn't come before ight.
Zero-width assertions are difficult to understand until you realize that regex matches positions as well as characters.
When you see the string "foo" you naturally read three characters. But, there are also four positions, marked here by pipes: "|f|o|o|". A lookahead or lookbehind (aka lookarounds) match a position where the character before or after match the expression.
The difference between a zero-width expression and other expressions is that the zero-width expression only matches (or "consumes") the position. So, for example:
/(app)apple/
will fail to match "apple" because it's trying to match "app" twice. But
/(?=app)apple/
will succeed because the lookahead is only matching the position where "app" follows. It doesn't actually match the "app" character, allowing the next expression to consume them.
LOOKAROUND DESCRIPTIONS
Positive Lookahead: (?=s)
Imagine you are a drill sergeant and you are performing an inspection. You begin at the front of the line with the intention of walking past each private and ensuring they meet expectations. But, before doing so, you look ahead one by one to make sure they have lined up in the property order. The privates' names are "A", "B", "C", "D" and "E". /(?=ABCDE)...../.match('ABCDE'). Yep, they are all present and accounted for.
Negative Lookahead: (?!s)
You perform the inspection down the line and are finally standing at private D. Now you are going to look ahead to make sure that "F" from the other company has not, yet again, accidentally slipped into the wrong formation. /.....(?!F)/.match('ABCDE'). Nope, he hasn't slipped in this time, so all is well.
Positive Lookbehind: (?<=s)
After completing the inspection, the sergeant is at the end of the formation. He turns and scans back to make sure no one has snuck away. /.....(?<=ABCDE)/.match('ABCDE'). Yep, everyone is present and accounted for.
Negative Lookbehind: (?<!s)
Finally, the drill sergeant takes one last look to make sure that privates A and B have not, once again, switched places (because they like KP). /.....(?<!BACDE)/.match('ABCDE'). Nope, they haven't, so all is well.
The meaning of a zero-width assertion is an expression that consumes zero characters while matching. For example, in this example,
"foresight".sub(/sight/, 'ee')
what is matched is
foresight
^^^^^
and thus the result would be
foreee
However, in this example,
"foresight".sub(/(?<=s)ight/, 'ee')
what is matched is
foresight
^^^^
and therefore the result would be
foresee
Another example of a zero-width assertion is the word-boundary character, \b. For example, to match a complete word, you might try surrounding the word with spaces, e.g.
"flight light plight".sub(/\slight\s/, 'dark')
to get
flightdarkplight
But you see how matching the spaces removes it during substitution? Using a word boundary gets around this problem:
"flight light plight".sub(/\blight\b/, 'dark')
The \b matches the beginning or end of a word, but does not actually match a character: it's zero-width.
Maybe the most succinct answer to your question is this: Lookahead and lookbehind assertions are one kind of zero-width assertions. All lookahead and lookbehind assertions are zero-width assertions.
Here are explanations of your examples:
irb(main):001:0> "foresight".sub(/(?!s)ight/, 'ee')
=> "foresee"
Above, you're saying, "Match where the next character is not an s, and then an i." This is always true for an i, since an i is never an s, so the substitution succeeds.
irb(main):002:0> "foresight".sub(/(?=s)ight/, 'ee')
=> "foresight"
Above, you're saying, "Match where the next character is an s, and then an i." This is never true, since an i is never an s, so the substitution fails.
irb(main):003:0> "foresight".sub(/(?<=s)ight/, 'ee')
=> "foresee"
Above, already explained. (This is the correct one.)
irb(main):004:0> "foresight".sub(/(?<!s)ight/, 'ee')
=> "foresight"
Above, should be clear by now. In this case, "firefight" would substitute to "firefee", but not "foresight" to "foresee".
I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).
I'm trying to match some text if it does not have another block of text in its vicinity. For example, I would like to match "bar" if "foo" does not precede it. I can match "bar" if "foo" does not immediately precede it using negative look behind in this regex:
/(?<!foo)bar/
but I also like to not match "foo 12345 bar". I tried:
/(?<!foo.{1,10})bar/
but using a wildcard + a range appears to be an invalid regex in Ruby. Am I thinking about the problem wrong?
You are thinking about it the right way. But unfortunately lookbehinds usually have be of fixed-length. The only major exception to that is .NET's regex engine, which allows repetition quantifiers inside lookbehinds. But since you only need a negative lookbehind and not a lookahead, too. There is a hack for you. Reverse the string, then try to match:
/rab(?!.{0,10}oof)/
Then reverse the result of the match or subtract the matching position from the string's length, if that's what you are after.
Now from the regex you have given, I suppose that this was only a simplified version of what you actually need. Of course, if bar is a complex pattern itself, some more thought needs to go into how to reverse it correctly.
Note that if your pattern required both variable-length lookbehinds and lookaheads, you would have a harder time solving this. Also, in your case, it would be possible to deconstruct your lookbehind into multiple variable length ones (because you use neither + nor *):
/(?<!foo)(?<!foo.)(?<!foo.{2})(?<!foo.{3})(?<!foo.{4})(?<!foo.{5})(?<!foo.{6})(?<!foo.{7})(?<!foo.{8})(?<!foo.{9})(?<!foo.{10})bar/
But that's not all that nice, is it?
As m.buettner already mentions, lookbehind in Ruby regex has to be of fixed length, and is described so in the document. So, you cannot put a quantifier within a lookbehind.
You don't need to check all in one step. Try doing multiple steps of regex matches to get what you want. Assuming that existence of foo in front of a single instance of bar breaks the condition regardless of whether there is another bar, then
string.match(/bar/) and !string.match(/foo.*bar/)
will give you what you want for the example.
If you rather want the match to succeed with bar foo bar, then you can do this
string.scan(/foo|bar/).first == "bar"