ruby regex: match URL recurring pattern - ruby

I want to be able to match all the following cases below using Ruby 1.8.7.
/pages/multiedit/16801,16809,16817,16825,16833
/pages/multiedit/16801,16809,16817
/pages/multiedit/16801
/pages/multiedit/1,3,5,7,8,9,10,46
I currently have:
\/pages\/multiedit\/\d*
This matches upto the first set of numbers. So for example:
"/pages/multiedit/16801,16809,16817,16825,16833"[/\/pages\/multiedit\/\d*/]
# => "/pages/multiedit/16801"
See http://rubular.com/r/ruFPx5yIAF for example.
Thanks for the help, regex gods.

\/pages\/multiedit\/\d+(?:,\d+)*
Example: http://rubular.com/r/0nhpgki6Gy
Edit: Updated to not capture anything... Although the performance hit would be negligible. (Thanks Tin Man)
The currently accepted answer of
\/pages\/multiedit\/[\d,]+
may not be a good idea because that will also match the following strings
.../pages/multiedit/,,,
.../pages/multiedit/,1,
My answer requires there be at least one digit before the first comma, and at least one digit between commas, and it must end with a digit.

I'd use:
/\/pages\/multiedit\/[\d,]+/
Here's a demonstration of the pattern at http://rubular.com/r/h7VLZS1W1q
[\d,]+ means "find one or more numbers or commas"
The reason \d* doesn't work is it means "find zero or more numbers". As soon as the pattern search runs into a comma it stops. You have to tell the engine that it's OK to find numbers and commas.

Related

Weird thing in regex

When I was practice in rubular.com, I've be trying to match with a regular expression that checks if a word starts with a non-consonant. My approach it's check cases how that begins with a non-letter, or starts with a number or underscore, or checks the empty string
I've founded a strange behaviour:
My regex /^[aeiou_0-9\W]|^$/i match the k and s consonants!. I don't understand why.
Any ideas?
A link to example -> http://rubular.com/r/0zt0VPmcwr
This is very funny because you have stumbled across a bug specifically for just the letters k and s when using \W with /i (it's like a perfect storm).
Here is the link that explains the bug: https://bugs.ruby-lang.org/issues/4044
Perhaps this was patched in a later version of ruby, but if you don't feel like going through the hassle of going to a new version of ruby, then you can just explicitly make an inverted character class of all the consonants:
/^[^bcdfghjklmnpqrstvwxyz]|^$/i
Here is the rubular link: http://rubular.com/r/URgsWP3suQ
Edit:
So, something else I noticed about your regex is that your regex (and the regex I provided above) matches only the first letter of the words where as the regex that I provided matches the whole word. I don't know if this makes a difference for you, but I felt it was worth pointing out. Please see the difference in the highlighting in the rubular link above and the one below (See how the link above only highlights the first letter of the words where as the link below highlights the whole words):
^[^bcdfghjklmnpqrstvwxyz].*|^$
http://rubular.com/r/IVJ03uOK4h
It is a bug in Ruby regex in some versions. Select version 1.8.7 in the dropdown and you will see your regex works properly.
Edit. Check the docs at http://ruby-doc.org/core-2.1.5/Regexp.html. More specifically, in the metacharacters section:
/\W/ - A non-word character ([^a-zA-Z0-9_]). Please take a look at Bug #4044 if using /\W/ with the /i modifier.

Multi-Line Regex: Find A where B is absent

I have been looking through a lot on Regex lately and have seen a lot of answers involving the matching of one word, where a second word is absent. I have seen a lot of Regex Examples where I can have a Regex search for a given word (or any more complex regex in its place) and find where a word is missing.
It seems like the works very well on a line by line basis, but after including the multi-line mode it still doesn't seem to match properly.
Example: Match an entire file string where the word foo is included, but the word bar is absent from the file. What I have so far is (?m)^(?=.*?(foo))((?!bar).)*$ which is based off the example link. I have been testing with a Ruby Regex tester, but I think it is a open ended regex problem/question. It seems to match smaller pieces, I would like to have it either match/not match on the entire string as one big chunk.
In the provided example above, matches are found on a line by line basis it seems. What changes need to be made to the regex so it applies over the ENTIRE string?
EDIT: I know there are other more efficient ways to solve this problem that doesn't involve using a regex. I am not looking for a solution to the problem using other means, I am asking from a theoretical regex point of view. It has a multi-line mode (which looks to "work"), it has negative/positive searching which can be combined on a line by line basis, how come combining these two principals doesn't yield the expected result?
Sawa's answer can be simplified, all that's needed is a positive lookahead, a negative lookahead, and since you're in multiline mode, .* takes care of the rest:
/(?=.*foo)(?!.*bar).*/m
Multiline means that . matches \n also, and matches are greedy. So the whole string will match without the need for anchors.
Update
#Sawa makes a good point for the \A being necessary but not the \Z.
Actually, looking at it again, the positive lookahead seems unnecessary:
/\A(?!.*bar).*foo.*/m
A regex that matches an entire string that does not include foo is:
/\A(?!.*foo.*).*\z/m
and a regex that matches from the beginning of an entire string that includes bar is:
/\A.*bar/m
Since you want to satisfy both of these, take a conjunction of these by putting one of them in a lookahead:
/\A(?=.*bar)(?!.*foo.*).*\z/m

How to conflate consecutive gsubs in ruby

I have the following
address.gsub(/^\d*/, "").gsub(/\d*-?\d*$/, "").gsub(/\# ?\d*/,"")
Can this be done in one gsub? I would like to pass a list of patterns rather then just one pattern - they are all being replaced by the same thing.
You could combine them with an alternation operator (|):
address = '6 66-666 #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/, "")
# " 66-666 "
address = 'pancakes 6 66-666 # pancakes #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/,"")
# "pancakes 6 66-666 pancakes "
You might want to add little more whitespace cleanup. And you might want to switch to one of:
/\A\d*|\d*-?\d*\z|\# ?\d*/
/\A\d*|\d*-?\d*\Z|\# ?\d*/
depending on what your data really looks like and how you need to handle newlines.
Combining the regexes is a good idea--and relatively simple--but I'd like to recommend some additional changes. To wit:
address.gsub(/^\d+|\d+(?:-\d+)?$|\# *\d+/, "")
Of your original regexes, ^\d* and \d*-?\d*$ will always match, because they don't have to consume any characters. So you're guaranteed to perform two replacements on every line, even if that's just replacing empty strings with empty strings. Of my regexes, ^\d+ doesn't bother to match unless there's at least one digit at the beginning of the line, and \d+(?:-\d+)?$ matches what looks like an integer-or-range expression at the end of the line.
Your third regex, \# ?\d*, will match any # character, and if the # is followed by a space and some digits, it'll take those as well. Judging by your other regexes and my experience with other questions, I suspect you meant to match a # only if it's followed by one or more digits, with optional spaces intervening. That's what my third regex does.
If any of my guesses are wrong, please describe what you were trying to do, and I'll do my best to come up with the right regex. But I really don't think those first two regexes, at least, are what you want.
EDIT (in answer to the comment): When working with regexes, you should always be aware of the distinction between a regex the matches nothing and a regex that doesn't match. You say you're applying the regexes to street addresses. If an address doesn't happen to start with a house number, ^\d* will match nothing--that is, it will report a successful match, said match consisting of the empty string preceding the first character in the address.
That doesn't matter to you, you're just replacing it with another empty string anyway. But why bother doing the replacement at all? If you change the regex to ^\d+, it will report a failed match and no replacement will be performed. The result is the same either way, but the "matches noting" scenario (^\d*) results in a lot of extra work that the "doesn't match" scenario avoids. In a high-throughput situation, that could be a life-saver.
The other two regexes bring additional complications: \d*-?\d*$ could match a hyphen at the end of the string (e.g. "123-", or even "-"); and \# ?\d* could match a hash symbol anywhere in string, not just as part of an apartment/office number. You know your data, so you probably know neither of those problems will ever arise; I'm just making sure you're aware of them. My regex \d+(?:-\d+)?$ deals with the trailing-hyphen issue, and \# *\d+ at least makes sure there are digits after the hash symbol.
I think that if you combine them together in a single gsub() regex, as an alternation,
it changes the context of the starting search position.
Example, each of these lines start at the beginning of the result of the previous
regex substitution.
s/^\d*//g
s/\d*-?\d*$//g
s/\# ?\d*//g
and this
s/^\d*|\d*-?\d*$|\# ?\d*//g
resumes search/replace where the last match left off and could potentially produce a different overall output, especially since a lot of the subexpressions search for similar
if not the same characters, distinguished only by line anchors.
I think your regex's are unique enough in this case, and of course changing the order
changes the result.

Regular expression syntax

I have a similar problem, to a previously asked question. But similar practices apparently do not produce similar results.
Previous Question
New question - I want to match the lines beginning in T as the first match, and the following lines beginning with X as the second match (as a whole string, to be later matched by another regex)
What I have so far is (^T(\d+)\n(.*?)(?:the_problem)/m) I don't know what to replace "the_problem" with, or even if that is the issue. I assumed some rendition (?:\n|\z), but apparently not. Everything I tried, would not count the next occurrence of ^T(\d+) as the start of a new group, and continue to capture all of the lines between each occurrence, at the same time.
Sample text;
T01C0.025
T02C0.035
T03C0.055
T04C0.150
T05C0.065
T06C0.075
%
G05
G90
T01
X011200Y004700
X011200Y009700
X018500Y011200
X013500Y-011200
X023800Y019500
T02
X034800Y017800
X-033800Y-017800
X032800Y017800
T03
X036730Y003000
X038700Y003000
X040668Y-003000
X059230Y003000
T04
X110580Y017800
X023800Y027300
X095500Y028500
X005500Y-006500
X021500Y-006500
T05
X003950Y002000
X003950Y004500
X003950Y007000
T06
X026300Y027300
M30
I only want to capture the shorter version of T01, T02,...T0n, not the longer version at the top, then the entire collection of ^X(-?\d+)Y(-?\d+) that follows it, as another match.
Result 1.
Match 1. T01
Match 2. X011200Y004700
X011200Y009700
X018500Y011200
X013500Y-011200
X023800Y019500
Result 2.
Match 1. T02
Match 2. X034800Y017800
X-033800Y-017800
X032800Y017800
Result 3.
Match 1. T03
Match 2. X036730Y003000
X038700Y003000
....etc....
Thanks in advance for any help ;-) Note: I prefer to use raw Ruby, without extensions or plugins. My version of ruby is 1.8.6.
Try this instead:
^(T[^\s]+)[\n\r\s]((?:(?:X\S+)[\n\r\s])+)
It makes the groups for the X lines into non-capturing groups, then puts all the repetitions of the final pattern into a single group. All the X lines will be in a single capture.
You can test this using Rubular (an indispensable tool for developing regular expressions) http://rubular.com/r/PRnurKy64Q
this seems to work...
^(T[^\s]+)[\n\r\s]((X[^\s]+)[\n\r\s]){1,}
I'm not totally sure I understand your problem, but I'll give this a shot. It looks like you want:
/(^T\d+$(^X[-A-Z\d]+$)+)*/g
This will have to be run under multiline mode so that ^ and $ match after and before newlines. Word of caution: I don't have much practice with mulitline regex, so you might want to do a sanity check on the use of ^ and $.
Also, I notice you didn't include the lines similar to T01C0.025 in your sample results, so I made the T\d+ assumption based on that.

gsub partial replace

I would like to replace only the group in parenthesis in this expression :
my_string.gsub(/<--MARKER_START-->(.)*<--MARKER_END-->/, 'replace_text')
so that I get : <--MARKER_START-->replace_text<--MARKER_END-->
I know I could repeat the whole MARKER_START and MARKER_END blocks in the substitution expression but I thought there should be a more simple way to do this.
You can do it with zero width look-ahead and look-behind assertions.
This regex should work in ruby 1.9 and in perl and many other places:
Note: ruby 1.8 only supports look-ahead assertions. You need both look-ahead and look-behind to do this properly.
s.gsub( /(?<=<--MARKER START-->).*?(?=<--MARKER END-->)/, 'replacement text' )
What happens in ruby 1.8 is the ?<= causes it to crash because it doesn't understand the look-behind assertion. For that part, you then have to fall back to using a backreference - like Greig Hewgill mentions
so what you get is
s.gsub( /(<--MARKER START-->).*?(?=<--MARKER END-->)/, '\1replacement text' )
EXPLANATION THE FIRST:
I've replaced the (.)* in the middle of your regex with .*? - this is non-greedy.
If you don't have non-greedy, then your regex will try and match as much as it can - if you have 2 markers on one line, it goes wrong. This is best illustrated by example:
"<b>One</b> Two <b>Three</b>".gsub( /<b>.*<\/b>/, 'BOLD' )
=> "BOLD"
What we actually want:
"<b>One</b> Two <b>Three</b>".gsub( /<b>.*?<\/b>/, 'BOLD' )
=> "BOLD Two BOLD"
EXPLANATION THE SECOND:
zero-width-look-ahead-assertion sounds like a giant pile of nerdly confusion.
What "look-ahead-assertion" actually means is "Only match, if the thing we are looking for, is followed by this other stuff.
For example, only match a digit, if it is followed by an F.
"123F" =~ /\d(?=F)/ # will match the 3, but not the 1 or the 2
What "zero width" actually means is "consider the 'followed by' in our search, but don't count it as part of the match when doing replacement or grouping or things like that.
Using the same example of 123F, If we didn't use the lookahead assertion, and instead just do this:
"123F" =~ /\dF/ # will match 3F, because F is considered part of the match
As you can see, this is ideal for checking for our <--MARKER END-->, but what we need for the <--MARKER START--> is the ability to say "Only match, if the thing we are looking for FOLLOWS this other stuff". That's called a look-behind assertion, which ruby 1.8 doesn't have for some strange reason..
Hope that makes sense :-)
PS: Why use lookahead assertions instead of just backreferences? If you use lookahead, you're not actually replacing the <--MARKER--> bits, only the contents. If you use backreferences, you are replacing the whole lot. I don't know if this incurs much of a performance hit, but from a programming point of view it seems like the right thing to do, as we don't actually want to be replacing the markers at all.
You could do something like this:
my_string.gsub(/(<--MARKER_START-->)(.*)(<--MARKER_END-->)/, '\1replace_text\3')

Resources