How do I make part of a regular expression optional in Ruby? - ruby

To match the following:
On Mar 3, 2011 11:05 AM, "mr person"
wrote:
I have the following regular expression:
/(On.* (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}.* at \d{1,2}:\d{1,2} (?:AM|PM),.*wrote:)/m
Is there a way to make the at optional? so if it's there great, if not, it still matches?

Sure. Put it in parentheses, put a question mark after it. Include one of the spaces (since otherwise you'll be trying to match two spaces if the "at" is missing.) (at )? (or as someone else suggested, (?:at )? to avoid it being captured).

Don't forget (?:) to make sure the bracketed expression doesn't get captured
(?:at)?

Sure, you just need to group the optional part...
(at )*
And, ok, I guess that will match at at at at, so you might want to just do:
(at )?

Others got your answer. This is just an aside re: Regular Expressions.
When you say "conditions" in regular expressions, it refers to the regex language. Like any language, its a branch in code execution, but the code is a different regular expression path, the "code" of regular expressions.
So in psudo code: if (evaluation is true) do this regular sub-expression, else do this other sub-expression.
This conditional exists in advanced regular expression engines ... Perl.
Perl uses the most advanced regular expression engine that exists. In version 6 and beyond it will be an integral part of the language, where code and expression intermingle seamlessly.
Perl 5.10 has this construct:
(?(condition)yes-pattern|no-pattern).
Edit Just a warning that where Perl goes, every other language follows as far as regular expression.

Related

What does /anystring/ mean in ruby?

I came across this: /sera/ === coursera. What does /sera/ mean? Please tell me. I do not understand the meaning of the expression above.
It's a regular expression. The more formal version of same is this:
coursera.match(/sera/)
Or:
/sera/.match(coursera)
These are both functionally similar. Either a string matches a regular expression, or a regular expression can be tested for matches against a string.
The long explanation of your original code is: Are the characters sera can be found in the variable coursera?
If you do this:
"coursera".match(/sera/)
# => #<MatchData "sera">
You get a MatchData result which means it matched. For more complicated expressions you can capture parts of the string using arbitrary patterns and so on. The general rule here is regular expressions in Ruby look like /.../ or vaguely like %r[...] in form.
You may also see the =~ operator used which is something Ruby inherited from Perl. It also means match.

escape sequence \K for regular expression in boost library

I need to replace a look-behind expression with \K in boost (version 1.54) because of its limitation but it does not work. How can I do it or what is the problem? Is there any other way to convert this expression with lookahead?
"(?<=foo.*) bar" => "foo.*\K bar" ???
Bit of a late answer here...
According to the Boost.Regex 1.54 Documentation, use of Perl's \K is possible, and I have just confirmed it via testing in Sublime Text 3, which uses Boost.Regex for its regex searching engine. Furthermore, I see no obvious syntactical error with either of the forms you posted. The only thing I can think of is that you're using the regex inside a string literal, and haven't escaped the \. If that's the case, the correct regex for your example would be:
foo.*\\K bar
If that's not the case, one workaround (that obviously has performance implications) is to reverse the string, and then use a variable-width look-ahead.
The modified regex for your example would then be:
rab (?=.*oof)
I believe the problem is that Boost lookbehind pattern must be of fixed length.
Your expression contains a repeat .* which makes it variable length.

Ruby Regular Expressions: Matching if substring doesn't exist

I'm having an issue trying to capture a group on a string:
"type=gist\nYou need to gist this though\nbecause its awesome\nright now\n</code></p>\n\n<script src=\"https://gist.github.com/3931634.js\"> </script>\n\n\n<p><code>Not code</code></p>\n"
My regex currently looks like this:
/<code>([\s\S]*)<\/code>/
My goal is to get everything in between the code brackets. Unfortunately, it's matching up to the 2nd closing code bracket Is there a way to match everything inside the code brackets up until the first occurrence of ending code bracket?
All repetition quantifiers in regular expressions are greedy by default (matching as many characters as possible). Make the * ungreedy, like this:
/<code>([\s\S]*?)<\/code>/
But please consider using a DOM parser instead. Regex is just not the right tool to parse HTML.
And I just learned that for going through multiple parts, the
String.scan( /<code>(.*?)<\/code>/ ){
puts $1
}
is a very nice way of going through all occurences of code - but yes, getting a proper parser is better...

How can I simplify this regular expression?

The format I'm trying to match is:
# (Apple push notification codes)
"11a735e9 9f696c2f 700b2700 728042c6 137eeb7a 8442c27d 40e59d9e 3c7e0de7"
The simplest expression I can think of is: /((\w{8}\s){7}\w{8})/i
Can anyone think of a simpler one?
(I'm using Ruby regular expressions)
UPDATE - thanks to user1096188, I've removed \d - this is included in \w
You can detect a word boundary using \b, and use (?: to prevent capturing groups
/(?:\w{8}\b\s?){8}/
You could do this if the end of the match is the end of the whole string.
(\w{8}(:?\s|$)){7}
Taking #zapthedingbat's solution one stage further, it looks like the code only contains hexadecimal characters (0-9 and a-f) and spaces. So you could possibly sacrifice a little simplicity for accuracy.
I'm making an assumption, but I suspect letters g to z are invalid.
If the format is hexadecimal only (you should check Apple's documentation to be sure), a tighter match would be:
/(?:[0-9a-f]{8}\b\s?){8}/
EDIT
In fact, in Ruby, it looks like you should be able to do:
/(?:\h{8}\b\s?){8}/
> "11a735e9 9f696c2f 700b2700 728042c6 137eeb7a 8442c27d 40e59d9e 3c7e0de7".match(/((\w{8}\s)+)/)
> $&
=> "11a735e9 9f696c2f 700b2700 728042c6 137eeb7a 8442c27d 40e59d9e 3c7e0de7"

gsub partial replace

I would like to replace only the group in parenthesis in this expression :
my_string.gsub(/<--MARKER_START-->(.)*<--MARKER_END-->/, 'replace_text')
so that I get : <--MARKER_START-->replace_text<--MARKER_END-->
I know I could repeat the whole MARKER_START and MARKER_END blocks in the substitution expression but I thought there should be a more simple way to do this.
You can do it with zero width look-ahead and look-behind assertions.
This regex should work in ruby 1.9 and in perl and many other places:
Note: ruby 1.8 only supports look-ahead assertions. You need both look-ahead and look-behind to do this properly.
s.gsub( /(?<=<--MARKER START-->).*?(?=<--MARKER END-->)/, 'replacement text' )
What happens in ruby 1.8 is the ?<= causes it to crash because it doesn't understand the look-behind assertion. For that part, you then have to fall back to using a backreference - like Greig Hewgill mentions
so what you get is
s.gsub( /(<--MARKER START-->).*?(?=<--MARKER END-->)/, '\1replacement text' )
EXPLANATION THE FIRST:
I've replaced the (.)* in the middle of your regex with .*? - this is non-greedy.
If you don't have non-greedy, then your regex will try and match as much as it can - if you have 2 markers on one line, it goes wrong. This is best illustrated by example:
"<b>One</b> Two <b>Three</b>".gsub( /<b>.*<\/b>/, 'BOLD' )
=> "BOLD"
What we actually want:
"<b>One</b> Two <b>Three</b>".gsub( /<b>.*?<\/b>/, 'BOLD' )
=> "BOLD Two BOLD"
EXPLANATION THE SECOND:
zero-width-look-ahead-assertion sounds like a giant pile of nerdly confusion.
What "look-ahead-assertion" actually means is "Only match, if the thing we are looking for, is followed by this other stuff.
For example, only match a digit, if it is followed by an F.
"123F" =~ /\d(?=F)/ # will match the 3, but not the 1 or the 2
What "zero width" actually means is "consider the 'followed by' in our search, but don't count it as part of the match when doing replacement or grouping or things like that.
Using the same example of 123F, If we didn't use the lookahead assertion, and instead just do this:
"123F" =~ /\dF/ # will match 3F, because F is considered part of the match
As you can see, this is ideal for checking for our <--MARKER END-->, but what we need for the <--MARKER START--> is the ability to say "Only match, if the thing we are looking for FOLLOWS this other stuff". That's called a look-behind assertion, which ruby 1.8 doesn't have for some strange reason..
Hope that makes sense :-)
PS: Why use lookahead assertions instead of just backreferences? If you use lookahead, you're not actually replacing the <--MARKER--> bits, only the contents. If you use backreferences, you are replacing the whole lot. I don't know if this incurs much of a performance hit, but from a programming point of view it seems like the right thing to do, as we don't actually want to be replacing the markers at all.
You could do something like this:
my_string.gsub(/(<--MARKER_START-->)(.*)(<--MARKER_END-->)/, '\1replace_text\3')

Resources