regex multiple matches with OR look behind - ruby

I have the following string:
'/photos/full/1/454/6454.jpg?20140521103415','/photos/full/2/452/54_2.jpg?20140521104743','/photos/full/3/254/C2454_3.jpg?20140521104744'
What I want to parse is the address from / to the ? but I can't seem to figure it out.
So far I have /(?<=')[^?]*/ which will properly get the first link, but the second and third link will start with ,'/photos/full/... <--notice that it starts with a ,'
If I then try /(?<=',')[^?]*/ I get the second and third link but miss the first link.
Rather than do 2 regexes, is there a way I can combine them to do 1? I've tried using `/((?<=')|(?<=',')[^?]*/ to no avail.
My code is of the form matches = string.scan(regex) and then I run a match.each block...

In Ruby 2, which has \K, you can use this simple regex (see demo):
'\K/[^?]+
To see all the matches:
regex = /'\K\/[^?]+/
subject.scan(regex) {|result|
# inspect result
}
Explain Regex
' # '\''
\K # 'Keep Out!' abandons what we have matched so far
\/ # '/'
[^?]+ # any character except: '?' (1 or more times
# (matching the most amount possible))

You can use this:
(?<=,|^)'\K[^?]+
Where (?<=,|^) checks that the quote is preceded with a comma or the start of the string/line. And where \K removes all on the left (the comma here) from the match result.
or more simple:
[^?']+(?=\?)
all that is not a quote or a question mark followed by a question mark.

One can simply use a positive lookahead and non-greedy operator, and this of course is not limited to v2.0:
str.scan(/(?<=')\/.*?(?=\?)/)
#=> ["/photos/full/1/454/6454.jpg",
# "/photos/full/2/452/54_2.jpg",
# "/photos/full/3/254/C2454_3.jpg"]
Edit: I added a positive lookbehined for the single quote. See comments.

Related

What is the difference between these three alternative ways to write Ruby regular expressions?

I want to match the path "/". I've tried the following alternatives, and the first two do match, but I don't know why the third doesn't:
/\A\/\z/.match("/") # <MatchData "/">
"/\A\/\z/".match("/") # <MatchData "/">
Regexp.new("/\A\/\z/").match("/") # nil
What's going on here? Why are they different?
The first snippet is the only correct one.
The second example is... misleading. That string literal "/\A\/\z/" is, obviously, not a regex. It's a string. Strings have #match method which converts its argument to a regexp (if not already one) and match against it. So, in this example, it's '/' that is the regular expression, and it matches a forward slash found in the other string.
The third line is completely broken: don't need the surrounding slashes there, they are part of regex literal, which you didn't use. Also use single quoted strings, not double quoted (which try to interpret escape sequences like \A)
Regexp.new('\A/\z').match("/") # => #<MatchData "/">
And, of course, none of the above is needed if you just want to check if a string consists of only one forward slash. Just use the equality check in this case.
s == '/'

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

Ruby regex - gsub only captured group

I'm not quite sure I understand how non-capturing groups work. I am looking for a regex to produce this result: 5.214. I thought the regex below would work, but it is replacing everything including the non-capture groups. How can I write a regex to only replace the capture groups?
"5,214".gsub(/(?:\d)(,)(?:\d)/, '.')
# => ".14"
My desired result:
"5,214".gsub(some_regex)
#=> "5.214
non capturing groups still consumes the match
use
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
or
"5,214".gsub(/(?<=\d+)(,)(?=\d+)/, '.')
You can't. gsub replaces the entire match; it does not do anything with the captured groups. It will not make any difference whether the groups are captured or not.
In order to achieve the result, you need to use lookbehind and lookahead.
"5,214".gsub(/(?<=\d),(?=\d)/, '.')
It is also possible to use Regexp.last_match (also available via $~) in the block version to get access to the full MatchData:
"5,214".gsub(/(\d),(\d)/) { |_|
match = Regexp.last_match
"#{match[1]}.#{match[2]}"
}
This scales better to more involved use-cases.
Nota bene, from the Ruby docs:
the ::last_match is local to the thread and method scope of the method that did the pattern match.
gsub replaces the entire match the regular expression engine produces. Both capturing/non-capturing group constructs are not retained. However, you could use lookaround assertions which do not "consume" any characters on the string.
"5,214".gsub(/\d\K,(?=\d)/, '.')
Explanation: The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. That being said, we then look for and match the comma, and the Positive Lookahead asserts that a digit follows.
I know nothing about ruby.
But from what i see in the tutorial
gsub mean replace,
the pattern should be /(?<=\d+),(?=\d+)/ just replace the comma with dot
or, use capture /(\d+),(\d+)/ replace the string with "\1.\2"?
You can easily reference capture groups in the replacement string (second argument) like so:
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
#=> "5.214"
\0 will return the whole matched string.
\1 will be replaced by the first capturing group.
\2 will be replaced by the second capturing group etc.
You could rewrite the example above using a non-capturing group for the , char.
"5,214".gsub(/(\d+)(?:,)(\d+)/, '\1.\2')
#=> "5.214"
As you can see, the part after the comma is now the second capturing group, since we defined the middle group as non-capturing.
Although it's kind of pointless in this case. You can just omit the capturing group for , altogether
"5,214".gsub(/(\d+),(\d+)/, '\1.\2')
#=> "5.214"
You don't need regexp to achieve what you need:
'1,200.00'.tr('.','!').tr(',','.').tr('!', ',')
Periods become bangs (1,200!00)
Commas become periods (1.200!00)
Bangs become commas (1.200,00)

Regex - Matching text AFTER certain characters

I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

Match comma separated list with Ruby Regex

Given the following string, I'd like to match the elements of the list and parts of the rest after the colon:
foo,bar,baz:something
I.e. I am expecting the first three match groups to be "foo", "bar", "baz". No commas and no colon. The minimum number of elements is 1, and there can be arbitrarily many. Assume no whitespace and lower case.
I've tried this, which should work, but doesn't populate all the match groups for some reason:
^([a-z]+)(?:,([a-z]+))*:(something)
That matches foo in \1 and baz (or whatever the last element is) in \2. I don't understand why I don't get a match group for bar.
Any ideas?
EDIT: Ruby 1.9.3, if that matters.
EDIT2: Rubular link: http://rubular.com/r/pDhByoarbA
EDIT3: Add colon to the end, because I am not just trying to match the list. Sorry, oversimplified the problem.
This expression works for me: /(\w+)/i
If you want to do it with regex, how about this?
(?<=^|,)("[^"]*"|[^,]*)(?=,|$)
This matches comma-separated fields, including the possibility of commas appearing inside quoted strings like 123,"Yes, No". Regexr for this.
More verbosely:
(?<=^|,) # Must be preceded by start-of-line or comma
(
"[^"]*"| # A quote, followed by a bunch of non-quotes, followed by quote, OR
[^,]* # OR anything until the next comma
)
(?=,|$) # Must end with comma or end-of-line
Usage would be with something like Python's re.findall(), which returns all non-overlapping matches in the string (working from left to right, if that matters.) Don't use it with your equivalent of re.search() or re.match() which only return the first match found.
(NOTE: This actually doesn't work in Python because the lookbehind (?<=^|,) isn't fixed width. Grr. Open to suggestions on this one.)
Edit: Use a non-capturing group to consume start-of-line or comma, instead of a lookbehind, and it works in Python.
>>> test_str = '123,456,"String","String, with, commas","Zero-width fields next",,"",nyet,123'
>>> m = re.findall('(?:^|,)("[^"]*"|[^,]*)(?=,|$)',test_str)
>>> m
['123', '456', '"String"', '"String, with, commas"',
'"Zero-width fields next"', '', '""', 'nyet', '123']
Edit 2: The Ruby equivalent of Python's re.findall(needle, haystack) is haystack.scan(needle).
Maybe split will be better solution for this case?
'foo,bar,baz'.split(',')
=> ["foo", "bar", "baz"]
If I am interpreting your post correctly, you want everything separated by commas before the colon (:).
The appropriate regex for this would be:
[^\s:]*(,[^\s:]*)*(:.*)?
This should find everything you are looking for.

Resources