NP++: Regular expression - expression

I have a text with many expressions like this <.....>, e.g.:
<..> Text1 <.sdfdsvd> Text 2 <....dgdfg> Text3 <...something> Text4
How can I eliminate now all brackets <...> and all commands/texts between these brackets? But the other "real" text between these (like text1, text2 above) should not be touched.
I tried with the regular expression:
<.*>
But this finds also a block like this, including the inbetween text:
<..> Text1 <.sdfdsvd>
My second try was to search for alle expressions <.> without a third bracket between these two, so I tried:
<.*[^>^<]>
But that does not work either, no change in behavior. How to construct the needed expression correctly?

This works in Notepad++:
Find what: <[^>]+?>
Replace with: nothing
Try it out: http://regex101.com/r/lC9mD4
There are a few problems with your attempt: <.*[^>^<]>
.* matches all characters up through the final possible match. This means that all tags except the last will be bypassed. This is called greedy. In my solution, I have changed it to possessive, which goes up to the first possible match: .*?...although I apply this to the character class itself: [^>]+?.
[^>^<] is incorrect for two reasons, one small, one big. The small reason is that the first caret ^ says "do not match any of the following characters", and the characters following it are >, ^, and <. So you are saying you don't want to match the caret character, which is incorrect (but not harmful). The larger problem is that this is attempting to match exactly one character, when it needs to be one or more, which is signified by the plus sign: [^><]+.
Otherwise, your attempt is not that far off from my solution.

This seems to work:
<[^\s]*>
It looks for a left bracket, then anything that isn't whitespace between the brackets, then a right bracket. It would need some adjusting if there's whitespace between the brackets (<text1 text2>), though, and at that point a modification of one of your attempts would work better:
<[^<^>]*>
This one looks for a left bracket, then anything that isn't a left bracket or right bracket, then a right bracket.

Try <.*?>. If you don't use the "?", regular expressions will try to find the longest string that matches. Using "*?" will force to find the shortest.

Related

Minus sign that appears not to designate a range and escaping a closing parenthesis

I have 2 questions regarding the following regex from Why's Poignant Guide to Ruby:
1: What does the minus sign mean here? It doesn't seem to be designating a range because there is nothing to the left of it other than the bracket.
2: Why is it necessary to escape the closing parenthesis? After you escape the opening one, what special meaning could the closing parenthesis have?
/\([-\w]+\)/
1)When the minus sign is at the begining or at the end of a character class, it is seen as literal.
2) escaping closing parenthesis is a convention. The goal is IMO, to avoid an ambiguity with a possible opening parenthesis before. Consider these examples:
/(\([-\w]+\))/ or /(\([-\w]+)\)/
1) The minus sign is a literal minus sign. Since it cannot possibly designate a range, it has no special meaning and so the character class is equivalent to [\-\w] - escaping the hyphen is optional, as you observe in your second point...
2) ...however, it isn't always good form to not escape something just because the regular expression engine allows it. For example, this regex: ([([^)-]+) is perfectly valid (I think...) but entirely unclear because of the fact that characters which normally have special meanings are used as literal characters without being escaped. Valid, yes, but not obvious, and someone who doesn't know all the rules will become very confused trying to understand it.
The the minus sign -, or say the hyphen, means exact just the character -. The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. It's not designating a range, so it's not confusing. You can also choose to use \- if you like.
As to why to escape ), I think it means to reduce the regex engine's work so that it doesn't have to remember if an opening parenthesis is before.
- sign in this regex actually means a - sign that you want to see in the text.
Non-escaped parentheses means a match group, that will be available for you, for example, by $1 variable.
> "(-w)" =~ /\([-\w]+\)/
> $1 # => nil
and
> "(-w)" =~ /([-\w]+)/
> $1 # => -w
You can go to Rubular and try both regexes \([-\w]+\) and ([-\w]+) - and you will see different results by passing (-w) as a test. You can notice match groups appearing.

Removing parenthesis and digit from string with regex

I have strings that look like this:
Executive Producer (3)
Producer (0)
1st Assistant Camera (12)
I'd like to use a regex to match the first part of the string and to remove the " (num)" part (the space preceding the parentheses and the parenthesis/digit in the parentheses). After using the regex I'd want to have my vars equal to: "Executive Producer", "Producer", "1st Assistant Camera"
If you know any resources for learning regexes that would be great too.
You just have to select all the characters except the final parenthesis and their numeric content:
(.+) \(\d+\)
The first two parenthesis capture the content (here, all content, declared by the point). Then, you want two parenthesis (be careful to the slash), meaning we do not want these parenthesis to capture the "\d+" expression, which is a number.
One of my favorite regex site: http://www.regular-expressions.info/
Maybe s/([\s\w]+\w)\s*\(\d+\)/\1/?
I don't know Ruby, so you'd have to translate it to its own regexp syntax.

Ruby regex for text within parentheses

I am looking for a regex to replace all terms in parentheses unless the parentheses are within square brackets.
e.g.
(matches) #match
[(do not match)] #should not match
[[does (not match)]] #should not match
I current have:
[^\]]\([^()]*\) #Not a square bracket, an opening bracket, any non-bracket character and a closing bracket.
However this is still matching words within the square brackets.
I have also created a rubular page of my progress so far: http://rubular.com/r/gG22pFk2Ld
A regex is not going to cut it for you if you can nest the square brackets (see this related question).
I think you can only do this with a regex if (a) you only allow one level of square brackets and (b) you assume all square brackets are properly matched. In that case
\([^()]*\)(?![^\[]*])
is sufficient - it matches any parenthesised expression not followed by an unpaired ]. You need (b) because of the limitations of negative lookbehind (only fixed length strings in 1.9, and not allowed at all in 1.8), which mean you are stuck matching (match)] even if you don't want to.
So basically if you need to nest, or to allow unmatched brackets, you should ditch the regex and look at the answer to the question I linked to above.
This is a type of expression you cannot parse using a pure-regex approach, because you need to keep track of the current nesting/state_if_in_square_bracket (so you don't have a type 3 language anymore).
However, depending on the exact circumstances, you can parse it with multiple regexes or simple parsers. Example approaches:
Split into sub-strings, delimited by
[/[[or ]/]], change the state
when such a square bracket is
encountered, replace () in a
sub-string if in
"not_in_square_bracket" state
Parse for square brackets (including content), remove & remember them (these are "comments"), now replace all the content in normal brackets and re-add the square brackets stuff (you can remember stuff by using unique temp strings)
The complexity of your solution also depends on the detail if escaping ] is allowed.

Dot operator in negative bracket expression

The Ruby in Tim Bray's Wide Finder benchmark (http://wikis.sun.com/display/WideFinder/The+Benchmark) has this line:
%r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
I've been using regexes for a long time, but I'm not sure what the point of the "." is. It seems to match on anything that's not a space, but [^ ] would do that anyway.
When I first looked at it, it looked to me like it would match on nothing except possibly a line break.
Can anybody explain the behavior of this expression?
[^ .] means match any single character apart from a space or a literal period. The period does not have a special meaning when inside square brackets.

regular expression back referencing

why this snippet:
'He said "Hello"' =~ /(\w)\1/
matches "ll"? I thought that the \w part matches "H", and hence \1 refers to "H", thus nothing should be matched? but why this result?
I thought that the \w part matches "H"
\w matches any alphanumerical character (and underscore). It also happens to match H but that’s not terribly interesting since the regular expression then goes on to say that this has to be matched twice – which H can’t in your text (since it doesn’t appear twice consecutively), and neither is any of the other characters, just l. So the regular expression matches ll.
You're thinking of /^(\w)\1/. The caret symbol specifies that the match must start at the beginning of the line. Without that, the match can start anywhere in the string (it will find the first match).
and you're right, nothing was matched at that position. then regex went further and found match, which it returned to you.
\w is of course matches any word character, not just 'H'.
The point is, "\1" means one repetition of the "(\w)" block, only the letter "l" is doubled and will match your regex.
A nice page for toying around with ruby and regular expressions is Rubular

Resources