How to extract first and last name from full name - ruby

I have a regex that, given the full name, is supposed to capture the first and last name. It should exclude the suffix, like "Jr.":
(.+)\s(.+(?!\sJr\.))
But this regex applied against the string Larry Farry Barry Jones Jr. gives the match:
1. Larry Farry Barry Jones
2. Jr.
Why is my negative lookahead failing to ignore the "Jr." when parsing the full name? I want match #2 to contain "Jones".

Rather than trying to do it with a single regex, I think the following would be a more maintainable code.
full_name = "Larry Farry Barry Jones Jr."
name_parts = full_name.split - ["Jr."]
first_name, last_name = name_parts[0], name_parts[-1]

As a comment mentions it is the first .* that matches most of the string. The use of look ahead seems in correct here, as you do not want to return that value and do not need it to be included in a further match.
The following will split all words up but not return the 'Jr.' So you could take the first and last result.
(\w+\s)+?(?!\sJr\.)
I recommend Rubular for practicing Ruby RegExp.

The reason is that your string is matched by your .+ till the end and then does the regex lookahead, there is no "Jr." following (because we are already at the end) ==> perfect, we match!!!
But that is because your pattern is wrong. Better would be this:
\S+(?:\s(?!Jr\.)\S+)*
See it here on Regexr
Means:
\S+ match a series of at least one non whitespace character.
(?:\s(?!Jr\.)\S+)* Non capturing group: Match a whitespace and then, if it is not "Jr.", match the next series of non whitespace characters. This complete group can be repeated 0 or more times.

Related

Google Analytics | Multiple Exact Matches In Advanced Filter

I have results for FOOD, FOOD20 and FOOD 30 but have other results that come from FOOD such as DOGFOOD, CATFOOD using REGEX.
I am trying to place an EXACT filter by using:-
FOOD|FOOD20|FOOD30
to extract just these results instead of using REGEX. Unfortunately this is returning 0 results.
Is there another work around for this?
An exact filter is a literal string match, so you're explicitly looking for something matching all of "FOOD|FOOD20|FOOD30" exactly.
If you want to ensure that the value is exactly FOOD, FOOD20 or FOOD30, use REGEX matching, but precede each value with a caret (^), which marks the beginning of the line, and follow each value with the dollar sign ($), which marks the end of the line.
So, your REGEX expression would be:
^FOOD$|^FOOD20$|^FOOD30$
If your idea is to track anything that starts with "FOOD", followed by a number, and then ends, you can simplify your expression to the following:
^FOOD[0-9]*$
(The [0-9]* part means match the numbers 0 to 9 zero or more times, so it matches when there are no numbers after FOOD, or when there are some.)
This will match FOOD, FOOD20, FOOD30, FOOD99 and FOOD100, but not CATFOOD, DOGFOOD10, etc.

How to select words with punctuation and exclude periods from abbreviations?

I have the following Ruby Regex that selects punctuation and excludes periods that are part of numbers:
/\p{L}+|(?!\.\d)[[:punct:]]/
The profit was 5.2 thousand dollars.
=> The profit was thousand dollars.
I have a regex that can select abbreviations (U.S.A) for example:
(?:[a-zA-Z]\.){2,}
The U.S.A. is located in North America.
=> U.S.A.
I would like to use the ideas behind these regexes so that I can select all of the words and punctuation in a sentence except for any periods in any abbreviation as:
The U.S.A. is located in North America!
=> The USA is located in North America!
Any ideas on how to accomplish this?
I think it should be done in 2 steps because you cannot match discontinuous parts of text with one matching iteration.
Use
s = 'The U.S.A. is located in North America!'
s = s.gsub(/\b(?:\p{L}\.){2,}/) { $~[0].gsub(".", "") }
puts s.scan(/\p{L}+|(?!\.\d)[[:punct:]]/)
See the Ruby demo
The first step is to run a gsub with the \b(?:\p{L}\.){2,} pattern (I added a word boundary to make sure the pattern only matches 1 letter chunks). Within the block, the match value is stripped from dots using a literal text replacement.
The second step is running your first regex within a scan to collect the chunks you need.
str = "The U.S.A. have 50.1415 states approx and are located in North America!"
str.gsub(/(?<!\p{L}\p{L})\P{L}*\.[^\p{L}\s]*/, '').squeeze
#⇒ "The USA have states aprox and are located in North America!"
I think using regex alone will be difficult, I'll be glad to be corrected with a working solution.
My solution:
Parse the code that you don't want (the abbreviations) using your second regex first, and then use the first regex (selects words and punctuations). This will effectively hide the abbreviations for processing when you run your first regex.
I have a similar requirement for a project. The key thing is to use the partition method, loop through the regex's (2 in your case) and make sure you don't use the same regex to the string that was "captured" by the previous regex in the loop.
You may use this class from github: SourceParser and use it like this:
parser = SourceParser.new
parser.regexter('abbrs', /(?:[a-zA-Z]\.){2,}/) # return matched as is
parser.regexter(
'first regex',
/\p{L}+|(?!\.\d)[[:punct:]]/,
lambda do |token, regexp|
"(#{token})"
end
)
parser.parse("The U.S.A. is located in North America")
# => (The) U.S.A. (is) (located) (in) (North) (America)

Ruby Regex: Match Until First Occurance of Character

I have a file with lines that vary in their format, but the basic idea is like this:
- A block of text #tag #due(2014-04-20) #done(2014-04-22)
For example:
- Email John Doe #email #due(2014-04-20) #done(2014-04-22)
The issue is the #tag and the #due date do not appear in every entry, so some are just like:
- Email John Doe #done(2014-04-22)
I'm trying to write a Ruby Regex that finds the item between the "- " and the first occurrence of EITHER a hashtag or a #done/#due tag.
I have been trying to use groups and look ahead, but I can't seem to get it right when there are multiple instances of what I am looking ahead for. Using my second example string, this Regex:
/-\s(.*)(?=[#|#])/
Yields this result for (.*):
Email John Doe #email #due(2014-04-22)
Is there any way I can get this right? Thanks!
You're missing the ? quantifier to make it a non greedy match. And I would remove | from inside of your character class because it's trying to match a single character in the list (#|#) literally.
/-\s(.*?)(?=[##])/
See Demo
You really don't need a Positive Lookahead here either, just match up until those characters and print the result from your capturing group.
/-\s(.*?)[##]/
You could also use negation in this case.
/-\s([^##]*)/
This should do it:
str = "- Email John Doe #email #due(2014-04-20) #done(2014-04-22)"
str[/-(.*?)#|#due|#done/,1]
#=> " Email John Doe "
(.*?) is a capture group, with ? making .* non-greedy. The result of the capture is retrieved by the ,1 at the end.
Credit to #hwnd for noticing the need to make .* non-greedy shortly before I posted, though I did not see the comment until later.

Regular expression match that excludes characters inside parenthesis

I have the following types of strings.
BILL SMITH (USA)
WINTHROP (FR)
LORD AT WAR (GB)
KIM SMITH
With these strings, I have the following constraints:
1. all caps
2. can be 2 to 18 charters long
3. should not have any white spaces or carriage returns at the end
4. the country abbreviation inside the parens should be excluded
5. some of the names will not have the country in parens and they should be matched too
After applying my regular expression I'd like to get the following:
BILL SMITH (USA) => BILL SMITH
WINTHROP (FR) => WINTHROP
LORD AT WAR (GB) = LORD AT WAR
KIM SMITH => KIM SMITH
I came up with the following regular expression but I'm not getting any matches:
String.scan(\([A-Z \s*]{1,18})(^?!(\([A-Z]{1,3}\)))\)
I been banging my head on this for a while so if anyone can point error I'd appreciated it.
UPDATE:
I've gotten some great responses, however, so far none of the regular expression solutions have met all the constraints. The tricky part seems to be that some of the string has the country in parenthesis and some don't. In one case strings without the country was not being matched and in another it was returning the correct string along with the country abbreviation without the parenthesis. (See the comments on the second response.) One point of clarification: All of the strings that I will be matching will be the start point of the string. Not sure if that helps or not. Thanks again for all your help.
Here's one solution:
^((?:[A-Z]|\s){2,18}+?)(?:\s\([A-Z]+\))?$
See it on Rubular. Note that it counts 18 characters before the parenthesis - not sure how you want it to behave specificically. If you want to make sure the whole line isn't more than 18 characters, I suggest you just do unless line.length < 18 ... Similarly, if you want to make sure there is no whitespace at the end, I recommend you use line.strip. That'll greatly reduce the complexity of the Regexp you need and make your code more readable.
Edit: also works when no parentheses are used after the name.
The biggest error is that you wrote (^?!...) where you meant (?=...). The former means "an optional start-of-line anchor, followed by !, followed by ..., inside a capture group"; the latter means "a position in the string that is followed by ...". Fixing that, as well as makin a few other tweaks, and adding the requirement that the initial string end with a letter, we get:
[A-Z\s]{1,17}[A-Z])(?=\s*\([A-Z]{1,3}\)
Update based on OP comments: Since this will always match at the start of a string, you can use \A to "anchor" your pattern to the start of the string. You can then get rid of the lookahead assertion. This:
\A[A-Z][A-Z\s]{0,16}[A-Z]
matches start-of-string, followed by an uppercase letter, followed by up to 16 characters that are either uppercase letters or whitespace characters, followed by an uppercase letter.
You can also just use gsub to remove the part(s) you don't want. To remove everything in parenthesis you could do:
str.gsub(/\s*\([^)]*\)/, '')

Ruby regex: remove first name, leave last name

I am parsing a text and I want to ignore people's first names.
Examples (cases):
B.Obama => Obama
B. Obama => Obama
B . Obama => Obama
I manage to write this working Ruby regex:
"B.Obama".gsub(/\p{L}+\.(\p{L}+)/, '\\1')
However, it solves only one case. Also, it doesn't check, if the first letter is capital.
So, how should the regex, which combines all these cases, look like?
Details: Ruby 1.92 and UTF-8 strings.
I Gave a it a little bit more thought and I like this better:
/^(\w+)[ .,](.+$)/
This will capture both the first name and last name in different capturing groups
i.e.
"Mark del cato".scan /^(\w+)[ .,](.+$)/
see rubular for example: Rubular
Or Try
^[^ .]+
This will pick up the first word on a line. that is not delimited by a dot or a space.
Hope it helps, see example at Rubular
Try
(\w+)$
\w+ matches one or more 'word' characters.
The $ is a zero-length match matching the end of the string.
Do you want to be able to pull second names from a piece of text? That could get very difficult. Can you post an excerpt of your text?

Resources