Removing parenthesis and digit from string with regex - ruby

I have strings that look like this:
Executive Producer (3)
Producer (0)
1st Assistant Camera (12)
I'd like to use a regex to match the first part of the string and to remove the " (num)" part (the space preceding the parentheses and the parenthesis/digit in the parentheses). After using the regex I'd want to have my vars equal to: "Executive Producer", "Producer", "1st Assistant Camera"
If you know any resources for learning regexes that would be great too.

You just have to select all the characters except the final parenthesis and their numeric content:
(.+) \(\d+\)
The first two parenthesis capture the content (here, all content, declared by the point). Then, you want two parenthesis (be careful to the slash), meaning we do not want these parenthesis to capture the "\d+" expression, which is a number.
One of my favorite regex site: http://www.regular-expressions.info/

Maybe s/([\s\w]+\w)\s*\(\d+\)/\1/?
I don't know Ruby, so you'd have to translate it to its own regexp syntax.

Related

Discard contractions from string

I have a special use case where I want to discard all the contractions from the string and select only words followed by alphabets which do not contain any special character.
For eg:
string = "~ ASAP ASCII Achilles Ada Stackoverflow James I'd I'll I'm I've"
string.scan(/\b[A-z][a-z]+\b/)
#=> ["Achilles", "Ada", "Stackoverflow", "James", "ll", "ve"]
Note: It's not discarding the whole word I'll and I've
Can someone please help how to discard the whole word which contains contractions?
Try this Regex:
(?:(?<=\s)|(?<=^))[a-zA-Z]+(?=\s|$)
Explanation:
(?:(?<=\s)|(?<=^)) - finds the position immediately preceded by either start of the line or by a white-space
[a-zA-Z]+ - matches 1+ occurrences of a letter
(?=\s|$) - The substring matched above must be followed by either a whitespace or end of the line
Click for Demo
Update:
To make sure that not all the letters are in upper case, use the following regex:
(?:(?<=\s)|(?<=^))(?=\S*[a-z])[a-zA-Z]+(?=\s|$)
Click for Demo
The only thing added here is (?=\S*[a-z]) which means that there must be atleast one lowercase letter
I know that there's an accepted answer already, but I'd like to give my own shot:
(?<=\s|^)\w+[a-z]\w*
You can test it here. This regex is shorter and more efficient (157 steps against 315 from the accepted answer).
The explanation is rather simple:
(?<=\s|^)- This is a positive look behind. It means that we want strings preceded by a whitespace character or the start of the string.
\w+[a-z]\w* - This one means that we want strings composed by letters only (word characters) containing least one lowercase letter, thus discarding words which are whole uppercase. Along with the positive look behind, the whole regex ends up discarding words containing special characters.
NOTE: this regex won't take into account one-letter words. If you want to accomplish that, then you should use \w*[a-z]\w* instead, with a little efficiency cost.

Ruby regex syntax for "not matching one of the following"

Nice simple regex syntax question for you.
I have a block of text and i want to find instances of href=" or href=' which are NOT followed by either [ or http://
I can get "not followed by [" with
record.body =~ /href=['"](?!\[)/
and i can get "not followed by http://" with
record.body =~ /href=['"](?!http\:\/\/)/
But i can't quite work out how to combine the two.
Just to be clear: i want to find bad strings like this
`href="www.foo.com"`
but i'm ok with (ie don't want to find) strings like this
`href="http://www.foo.com"`
`href="[registration_url]"`
Combine the both by using the alternation operator.
href=['"](?!http\:\/\/|\[)
For more specific, it would be.
href=(['"])(?!http\:\/\/|\[)(?:(?!\1).)*\1
This would handle both single quoted or double quoted string in the href part. And this won't match the strings like href='foo.com" or href="foo.com' (unmatched quotes)
(['"]) would capture double quote or single quote. (?!http\:\/\/|\[) and the matched quote won't be followed by http:// or [, if yes, then it moves on to the next pattern. (?:(?!\1).)* matches any character but not of the captured character, zero or more times. \1 followed by the captured character.
DEMO
Use alternative list with pipe | symbol to combine the look-ahead conditions:
(?!http\:\/\/|\[)
So, to match the hrefs, you can use the following regex:
href=\"((?!http\:\/\/|\[)[^\"]+?)\"
See demo on Rubular.com.

NP++: Regular expression

I have a text with many expressions like this <.....>, e.g.:
<..> Text1 <.sdfdsvd> Text 2 <....dgdfg> Text3 <...something> Text4
How can I eliminate now all brackets <...> and all commands/texts between these brackets? But the other "real" text between these (like text1, text2 above) should not be touched.
I tried with the regular expression:
<.*>
But this finds also a block like this, including the inbetween text:
<..> Text1 <.sdfdsvd>
My second try was to search for alle expressions <.> without a third bracket between these two, so I tried:
<.*[^>^<]>
But that does not work either, no change in behavior. How to construct the needed expression correctly?
This works in Notepad++:
Find what: <[^>]+?>
Replace with: nothing
Try it out: http://regex101.com/r/lC9mD4
There are a few problems with your attempt: <.*[^>^<]>
.* matches all characters up through the final possible match. This means that all tags except the last will be bypassed. This is called greedy. In my solution, I have changed it to possessive, which goes up to the first possible match: .*?...although I apply this to the character class itself: [^>]+?.
[^>^<] is incorrect for two reasons, one small, one big. The small reason is that the first caret ^ says "do not match any of the following characters", and the characters following it are >, ^, and <. So you are saying you don't want to match the caret character, which is incorrect (but not harmful). The larger problem is that this is attempting to match exactly one character, when it needs to be one or more, which is signified by the plus sign: [^><]+.
Otherwise, your attempt is not that far off from my solution.
This seems to work:
<[^\s]*>
It looks for a left bracket, then anything that isn't whitespace between the brackets, then a right bracket. It would need some adjusting if there's whitespace between the brackets (<text1 text2>), though, and at that point a modification of one of your attempts would work better:
<[^<^>]*>
This one looks for a left bracket, then anything that isn't a left bracket or right bracket, then a right bracket.
Try <.*?>. If you don't use the "?", regular expressions will try to find the longest string that matches. Using "*?" will force to find the shortest.

Regex - Matching text AFTER certain characters

I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

Ruby regex for text within parentheses

I am looking for a regex to replace all terms in parentheses unless the parentheses are within square brackets.
e.g.
(matches) #match
[(do not match)] #should not match
[[does (not match)]] #should not match
I current have:
[^\]]\([^()]*\) #Not a square bracket, an opening bracket, any non-bracket character and a closing bracket.
However this is still matching words within the square brackets.
I have also created a rubular page of my progress so far: http://rubular.com/r/gG22pFk2Ld
A regex is not going to cut it for you if you can nest the square brackets (see this related question).
I think you can only do this with a regex if (a) you only allow one level of square brackets and (b) you assume all square brackets are properly matched. In that case
\([^()]*\)(?![^\[]*])
is sufficient - it matches any parenthesised expression not followed by an unpaired ]. You need (b) because of the limitations of negative lookbehind (only fixed length strings in 1.9, and not allowed at all in 1.8), which mean you are stuck matching (match)] even if you don't want to.
So basically if you need to nest, or to allow unmatched brackets, you should ditch the regex and look at the answer to the question I linked to above.
This is a type of expression you cannot parse using a pure-regex approach, because you need to keep track of the current nesting/state_if_in_square_bracket (so you don't have a type 3 language anymore).
However, depending on the exact circumstances, you can parse it with multiple regexes or simple parsers. Example approaches:
Split into sub-strings, delimited by
[/[[or ]/]], change the state
when such a square bracket is
encountered, replace () in a
sub-string if in
"not_in_square_bracket" state
Parse for square brackets (including content), remove & remember them (these are "comments"), now replace all the content in normal brackets and re-add the square brackets stuff (you can remember stuff by using unique temp strings)
The complexity of your solution also depends on the detail if escaping ] is allowed.

Resources