Dot operator in negative bracket expression - ruby

The Ruby in Tim Bray's Wide Finder benchmark (http://wikis.sun.com/display/WideFinder/The+Benchmark) has this line:
%r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
I've been using regexes for a long time, but I'm not sure what the point of the "." is. It seems to match on anything that's not a space, but [^ ] would do that anyway.
When I first looked at it, it looked to me like it would match on nothing except possibly a line break.
Can anybody explain the behavior of this expression?

[^ .] means match any single character apart from a space or a literal period. The period does not have a special meaning when inside square brackets.

Related

What does "1\/1." mean in Ruby?

I am learning Ruby and I have something to match with (/^1\/1. Guess a word from an anagram [RUBY]{4}$/)
Please, what does "1\/1." mean in this expression. Can anyone explain what's going on for me.
Thanks
Generally speaking, a backslash in a regular expression escapes the next character, so that it's treated as an ordinary character rather than whatever its special meaning would be. For instance a* matches zero or more of the letter a, but a\* matches, literally, an a followed by a star. Since most regular expressions in Ruby are wrapped in the delimiter /, we can't directly put forward slashes in our regex. If we had written
/^1/1. Guess a word from an anagram [RUBY]{4}$/
Then the regex would be /^1/ and the rest of the line would be a very confusing syntax error. This is for the same reasons that we can't put " characters directly inside of a "-delimited string.
So a backslash treats it as an actual slash in the expression rather than a delimiter.
/^1\/1. Guess a word from an anagram [RUBY]{4}$/
We're literally matches a 1 followed by a slash followed by a 1 at the start of the line.

NP++: Regular expression

I have a text with many expressions like this <.....>, e.g.:
<..> Text1 <.sdfdsvd> Text 2 <....dgdfg> Text3 <...something> Text4
How can I eliminate now all brackets <...> and all commands/texts between these brackets? But the other "real" text between these (like text1, text2 above) should not be touched.
I tried with the regular expression:
<.*>
But this finds also a block like this, including the inbetween text:
<..> Text1 <.sdfdsvd>
My second try was to search for alle expressions <.> without a third bracket between these two, so I tried:
<.*[^>^<]>
But that does not work either, no change in behavior. How to construct the needed expression correctly?
This works in Notepad++:
Find what: <[^>]+?>
Replace with: nothing
Try it out: http://regex101.com/r/lC9mD4
There are a few problems with your attempt: <.*[^>^<]>
.* matches all characters up through the final possible match. This means that all tags except the last will be bypassed. This is called greedy. In my solution, I have changed it to possessive, which goes up to the first possible match: .*?...although I apply this to the character class itself: [^>]+?.
[^>^<] is incorrect for two reasons, one small, one big. The small reason is that the first caret ^ says "do not match any of the following characters", and the characters following it are >, ^, and <. So you are saying you don't want to match the caret character, which is incorrect (but not harmful). The larger problem is that this is attempting to match exactly one character, when it needs to be one or more, which is signified by the plus sign: [^><]+.
Otherwise, your attempt is not that far off from my solution.
This seems to work:
<[^\s]*>
It looks for a left bracket, then anything that isn't whitespace between the brackets, then a right bracket. It would need some adjusting if there's whitespace between the brackets (<text1 text2>), though, and at that point a modification of one of your attempts would work better:
<[^<^>]*>
This one looks for a left bracket, then anything that isn't a left bracket or right bracket, then a right bracket.
Try <.*?>. If you don't use the "?", regular expressions will try to find the longest string that matches. Using "*?" will force to find the shortest.

Regex Tag-Within-Tag

I have a fairly simple regex problem for a little personal experiment that I haven't quite figured out.
In a string, I might have several <tag>[some characters here] that I need to match. The obvious way to do it would be with a /<tag>\[.*?\]/ regex, to match any characters after the <tag>[ and before the ].
I'd like to be able to have <tag>s within <tag>s, however. This causes a problem. If I had the following:
<tag>[some characters <tag>[in here] to match]
the regex would stop matching as soon as it reached the first closing-bracket, and completely fail to match the last part of the statement. I've tried to solve the problem by telling the regex to ignore any internal <tag>s, so I can do a match on the stripped contents later. I haven't quite gotten it working. The closest I've come is:
/<tag>\[(.*?(?:<tag>\[.*?\])*?.*?)\]/
which doesn't quite work. I would hope that it would match any number of characters, and any inner tags if they exist. It still has trouble with that first closing bracket, however.
Maybe somebody who's better at regular expressions knows a good solution to this.
Though you should probably drop regex and do this manually if the mini-language becomes more complex, you can use recursive regex.
Your regex would look something like this:
/(?<reg>(\w+\[([^\]\[]|\g<reg>)*\]))/
You can see it in action here: http://rubular.com/r/9F7isgZpj9
Here is the regex broken down to its parts:
(?<reg>( # start a regex named "reg"
\w+ # the tag name
\[ # open bracket
( # which can contain
[^\]\[] # non-bracket characters
| # or
\g<reg> # sub-tags (this is where the magic happens)
)* # zero or more times
\] # close the tag
)
)

Ruby regex for text within parentheses

I am looking for a regex to replace all terms in parentheses unless the parentheses are within square brackets.
e.g.
(matches) #match
[(do not match)] #should not match
[[does (not match)]] #should not match
I current have:
[^\]]\([^()]*\) #Not a square bracket, an opening bracket, any non-bracket character and a closing bracket.
However this is still matching words within the square brackets.
I have also created a rubular page of my progress so far: http://rubular.com/r/gG22pFk2Ld
A regex is not going to cut it for you if you can nest the square brackets (see this related question).
I think you can only do this with a regex if (a) you only allow one level of square brackets and (b) you assume all square brackets are properly matched. In that case
\([^()]*\)(?![^\[]*])
is sufficient - it matches any parenthesised expression not followed by an unpaired ]. You need (b) because of the limitations of negative lookbehind (only fixed length strings in 1.9, and not allowed at all in 1.8), which mean you are stuck matching (match)] even if you don't want to.
So basically if you need to nest, or to allow unmatched brackets, you should ditch the regex and look at the answer to the question I linked to above.
This is a type of expression you cannot parse using a pure-regex approach, because you need to keep track of the current nesting/state_if_in_square_bracket (so you don't have a type 3 language anymore).
However, depending on the exact circumstances, you can parse it with multiple regexes or simple parsers. Example approaches:
Split into sub-strings, delimited by
[/[[or ]/]], change the state
when such a square bracket is
encountered, replace () in a
sub-string if in
"not_in_square_bracket" state
Parse for square brackets (including content), remove & remember them (these are "comments"), now replace all the content in normal brackets and re-add the square brackets stuff (you can remember stuff by using unique temp strings)
The complexity of your solution also depends on the detail if escaping ] is allowed.

regular expression back referencing

why this snippet:
'He said "Hello"' =~ /(\w)\1/
matches "ll"? I thought that the \w part matches "H", and hence \1 refers to "H", thus nothing should be matched? but why this result?
I thought that the \w part matches "H"
\w matches any alphanumerical character (and underscore). It also happens to match H but that’s not terribly interesting since the regular expression then goes on to say that this has to be matched twice – which H can’t in your text (since it doesn’t appear twice consecutively), and neither is any of the other characters, just l. So the regular expression matches ll.
You're thinking of /^(\w)\1/. The caret symbol specifies that the match must start at the beginning of the line. Without that, the match can start anywhere in the string (it will find the first match).
and you're right, nothing was matched at that position. then regex went further and found match, which it returned to you.
\w is of course matches any word character, not just 'H'.
The point is, "\1" means one repetition of the "(\w)" block, only the letter "l" is doubled and will match your regex.
A nice page for toying around with ruby and regular expressions is Rubular

Resources