Regular expression match that excludes characters inside parenthesis - ruby

I have the following types of strings.
BILL SMITH (USA)
WINTHROP (FR)
LORD AT WAR (GB)
KIM SMITH
With these strings, I have the following constraints:
1. all caps
2. can be 2 to 18 charters long
3. should not have any white spaces or carriage returns at the end
4. the country abbreviation inside the parens should be excluded
5. some of the names will not have the country in parens and they should be matched too
After applying my regular expression I'd like to get the following:
BILL SMITH (USA) => BILL SMITH
WINTHROP (FR) => WINTHROP
LORD AT WAR (GB) = LORD AT WAR
KIM SMITH => KIM SMITH
I came up with the following regular expression but I'm not getting any matches:
String.scan(\([A-Z \s*]{1,18})(^?!(\([A-Z]{1,3}\)))\)
I been banging my head on this for a while so if anyone can point error I'd appreciated it.
UPDATE:
I've gotten some great responses, however, so far none of the regular expression solutions have met all the constraints. The tricky part seems to be that some of the string has the country in parenthesis and some don't. In one case strings without the country was not being matched and in another it was returning the correct string along with the country abbreviation without the parenthesis. (See the comments on the second response.) One point of clarification: All of the strings that I will be matching will be the start point of the string. Not sure if that helps or not. Thanks again for all your help.

Here's one solution:
^((?:[A-Z]|\s){2,18}+?)(?:\s\([A-Z]+\))?$
See it on Rubular. Note that it counts 18 characters before the parenthesis - not sure how you want it to behave specificically. If you want to make sure the whole line isn't more than 18 characters, I suggest you just do unless line.length < 18 ... Similarly, if you want to make sure there is no whitespace at the end, I recommend you use line.strip. That'll greatly reduce the complexity of the Regexp you need and make your code more readable.
Edit: also works when no parentheses are used after the name.

The biggest error is that you wrote (^?!...) where you meant (?=...). The former means "an optional start-of-line anchor, followed by !, followed by ..., inside a capture group"; the latter means "a position in the string that is followed by ...". Fixing that, as well as makin a few other tweaks, and adding the requirement that the initial string end with a letter, we get:
[A-Z\s]{1,17}[A-Z])(?=\s*\([A-Z]{1,3}\)
Update based on OP comments: Since this will always match at the start of a string, you can use \A to "anchor" your pattern to the start of the string. You can then get rid of the lookahead assertion. This:
\A[A-Z][A-Z\s]{0,16}[A-Z]
matches start-of-string, followed by an uppercase letter, followed by up to 16 characters that are either uppercase letters or whitespace characters, followed by an uppercase letter.

You can also just use gsub to remove the part(s) you don't want. To remove everything in parenthesis you could do:
str.gsub(/\s*\([^)]*\)/, '')

Related

Discard contractions from string

I have a special use case where I want to discard all the contractions from the string and select only words followed by alphabets which do not contain any special character.
For eg:
string = "~ ASAP ASCII Achilles Ada Stackoverflow James I'd I'll I'm I've"
string.scan(/\b[A-z][a-z]+\b/)
#=> ["Achilles", "Ada", "Stackoverflow", "James", "ll", "ve"]
Note: It's not discarding the whole word I'll and I've
Can someone please help how to discard the whole word which contains contractions?
Try this Regex:
(?:(?<=\s)|(?<=^))[a-zA-Z]+(?=\s|$)
Explanation:
(?:(?<=\s)|(?<=^)) - finds the position immediately preceded by either start of the line or by a white-space
[a-zA-Z]+ - matches 1+ occurrences of a letter
(?=\s|$) - The substring matched above must be followed by either a whitespace or end of the line
Click for Demo
Update:
To make sure that not all the letters are in upper case, use the following regex:
(?:(?<=\s)|(?<=^))(?=\S*[a-z])[a-zA-Z]+(?=\s|$)
Click for Demo
The only thing added here is (?=\S*[a-z]) which means that there must be atleast one lowercase letter
I know that there's an accepted answer already, but I'd like to give my own shot:
(?<=\s|^)\w+[a-z]\w*
You can test it here. This regex is shorter and more efficient (157 steps against 315 from the accepted answer).
The explanation is rather simple:
(?<=\s|^)- This is a positive look behind. It means that we want strings preceded by a whitespace character or the start of the string.
\w+[a-z]\w* - This one means that we want strings composed by letters only (word characters) containing least one lowercase letter, thus discarding words which are whole uppercase. Along with the positive look behind, the whole regex ends up discarding words containing special characters.
NOTE: this regex won't take into account one-letter words. If you want to accomplish that, then you should use \w*[a-z]\w* instead, with a little efficiency cost.

Split sentence by period followed by a capital letter

I'm trying to find a regex that will split a piece of text into sentences at ./?/! that is followed by a space that is followed by a capital letter.
"Hello there, my friend. In other words, i.e. what's up, man."
should split to:
Hello there, my friend| In other words, i.e. what's up, man|
I can get it to split on ./?/!, but I have no luck getting the space and capital letter criteria.
What I came up with:
.split("/. \s[A-Z]/")
Split a piece of text into sentences based on the criteria that it is a ./?/! that is followed by a space that is followed by a capital letter.
You may use a regex based on a lookahead:
s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/[!?.](?=\s+\p{Lu})/)
See the Ruby demo. In case you also need to split with the punctuation at the end of the string, use /[!?.](?=(?:\s+\p{Lu})|\s*\z)/.
Details:
[!?.] - matches a !, ? or . that is...
(?=\s+\p{Lu}) - (a positive lookahead) followed with 1+ whitespaces followed with 1 uppercase letter immediately to the right of the current location.
See the Rubular demo.
NOTE: If you need to split regular English text into sentences, you should consider using existing NLP solutions/libraries. See:
Pragmatic Segmenter
srx-english
The latter is based on regex, and can easily be extended with more regular expressions.
Apart from Wiktor's Answer you can also use lookarounds to find zero width and split on it.
Regex: (?<=[.?!]\s)(?=[A-Z]) finds zero width preceded by either [.?!] and space and followed by an upper case letter.
s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/(?<=[.?!]\s)(?=[A-Z])/)
Output
Hello there, my friend.
In other words, i.e. what's up, man.
Ruby Demo
Update: Based on Cary Swoveland's comment.
If the OP wanted to break the string into sentences I'd suggest (?<=[.?!])\s+(?=[A-Z]), as it removes spaces between sentences and permits the number of such spaces to be greater than one

How to select words with punctuation and exclude periods from abbreviations?

I have the following Ruby Regex that selects punctuation and excludes periods that are part of numbers:
/\p{L}+|(?!\.\d)[[:punct:]]/
The profit was 5.2 thousand dollars.
=> The profit was thousand dollars.
I have a regex that can select abbreviations (U.S.A) for example:
(?:[a-zA-Z]\.){2,}
The U.S.A. is located in North America.
=> U.S.A.
I would like to use the ideas behind these regexes so that I can select all of the words and punctuation in a sentence except for any periods in any abbreviation as:
The U.S.A. is located in North America!
=> The USA is located in North America!
Any ideas on how to accomplish this?
I think it should be done in 2 steps because you cannot match discontinuous parts of text with one matching iteration.
Use
s = 'The U.S.A. is located in North America!'
s = s.gsub(/\b(?:\p{L}\.){2,}/) { $~[0].gsub(".", "") }
puts s.scan(/\p{L}+|(?!\.\d)[[:punct:]]/)
See the Ruby demo
The first step is to run a gsub with the \b(?:\p{L}\.){2,} pattern (I added a word boundary to make sure the pattern only matches 1 letter chunks). Within the block, the match value is stripped from dots using a literal text replacement.
The second step is running your first regex within a scan to collect the chunks you need.
str = "The U.S.A. have 50.1415 states approx and are located in North America!"
str.gsub(/(?<!\p{L}\p{L})\P{L}*\.[^\p{L}\s]*/, '').squeeze
#⇒ "The USA have states aprox and are located in North America!"
I think using regex alone will be difficult, I'll be glad to be corrected with a working solution.
My solution:
Parse the code that you don't want (the abbreviations) using your second regex first, and then use the first regex (selects words and punctuations). This will effectively hide the abbreviations for processing when you run your first regex.
I have a similar requirement for a project. The key thing is to use the partition method, loop through the regex's (2 in your case) and make sure you don't use the same regex to the string that was "captured" by the previous regex in the loop.
You may use this class from github: SourceParser and use it like this:
parser = SourceParser.new
parser.regexter('abbrs', /(?:[a-zA-Z]\.){2,}/) # return matched as is
parser.regexter(
'first regex',
/\p{L}+|(?!\.\d)[[:punct:]]/,
lambda do |token, regexp|
"(#{token})"
end
)
parser.parse("The U.S.A. is located in North America")
# => (The) U.S.A. (is) (located) (in) (North) (America)

How to extract first and last name from full name

I have a regex that, given the full name, is supposed to capture the first and last name. It should exclude the suffix, like "Jr.":
(.+)\s(.+(?!\sJr\.))
But this regex applied against the string Larry Farry Barry Jones Jr. gives the match:
1. Larry Farry Barry Jones
2. Jr.
Why is my negative lookahead failing to ignore the "Jr." when parsing the full name? I want match #2 to contain "Jones".
Rather than trying to do it with a single regex, I think the following would be a more maintainable code.
full_name = "Larry Farry Barry Jones Jr."
name_parts = full_name.split - ["Jr."]
first_name, last_name = name_parts[0], name_parts[-1]
As a comment mentions it is the first .* that matches most of the string. The use of look ahead seems in correct here, as you do not want to return that value and do not need it to be included in a further match.
The following will split all words up but not return the 'Jr.' So you could take the first and last result.
(\w+\s)+?(?!\sJr\.)
I recommend Rubular for practicing Ruby RegExp.
The reason is that your string is matched by your .+ till the end and then does the regex lookahead, there is no "Jr." following (because we are already at the end) ==> perfect, we match!!!
But that is because your pattern is wrong. Better would be this:
\S+(?:\s(?!Jr\.)\S+)*
See it here on Regexr
Means:
\S+ match a series of at least one non whitespace character.
(?:\s(?!Jr\.)\S+)* Non capturing group: Match a whitespace and then, if it is not "Jr.", match the next series of non whitespace characters. This complete group can be repeated 0 or more times.

Ruby regex: remove first name, leave last name

I am parsing a text and I want to ignore people's first names.
Examples (cases):
B.Obama => Obama
B. Obama => Obama
B . Obama => Obama
I manage to write this working Ruby regex:
"B.Obama".gsub(/\p{L}+\.(\p{L}+)/, '\\1')
However, it solves only one case. Also, it doesn't check, if the first letter is capital.
So, how should the regex, which combines all these cases, look like?
Details: Ruby 1.92 and UTF-8 strings.
I Gave a it a little bit more thought and I like this better:
/^(\w+)[ .,](.+$)/
This will capture both the first name and last name in different capturing groups
i.e.
"Mark del cato".scan /^(\w+)[ .,](.+$)/
see rubular for example: Rubular
Or Try
^[^ .]+
This will pick up the first word on a line. that is not delimited by a dot or a space.
Hope it helps, see example at Rubular
Try
(\w+)$
\w+ matches one or more 'word' characters.
The $ is a zero-length match matching the end of the string.
Do you want to be able to pull second names from a piece of text? That could get very difficult. Can you post an excerpt of your text?

Resources