How to select words with punctuation and exclude periods from abbreviations? - ruby

I have the following Ruby Regex that selects punctuation and excludes periods that are part of numbers:
/\p{L}+|(?!\.\d)[[:punct:]]/
The profit was 5.2 thousand dollars.
=> The profit was thousand dollars.
I have a regex that can select abbreviations (U.S.A) for example:
(?:[a-zA-Z]\.){2,}
The U.S.A. is located in North America.
=> U.S.A.
I would like to use the ideas behind these regexes so that I can select all of the words and punctuation in a sentence except for any periods in any abbreviation as:
The U.S.A. is located in North America!
=> The USA is located in North America!
Any ideas on how to accomplish this?

I think it should be done in 2 steps because you cannot match discontinuous parts of text with one matching iteration.
Use
s = 'The U.S.A. is located in North America!'
s = s.gsub(/\b(?:\p{L}\.){2,}/) { $~[0].gsub(".", "") }
puts s.scan(/\p{L}+|(?!\.\d)[[:punct:]]/)
See the Ruby demo
The first step is to run a gsub with the \b(?:\p{L}\.){2,} pattern (I added a word boundary to make sure the pattern only matches 1 letter chunks). Within the block, the match value is stripped from dots using a literal text replacement.
The second step is running your first regex within a scan to collect the chunks you need.

str = "The U.S.A. have 50.1415 states approx and are located in North America!"
str.gsub(/(?<!\p{L}\p{L})\P{L}*\.[^\p{L}\s]*/, '').squeeze
#⇒ "The USA have states aprox and are located in North America!"

I think using regex alone will be difficult, I'll be glad to be corrected with a working solution.
My solution:
Parse the code that you don't want (the abbreviations) using your second regex first, and then use the first regex (selects words and punctuations). This will effectively hide the abbreviations for processing when you run your first regex.
I have a similar requirement for a project. The key thing is to use the partition method, loop through the regex's (2 in your case) and make sure you don't use the same regex to the string that was "captured" by the previous regex in the loop.
You may use this class from github: SourceParser and use it like this:
parser = SourceParser.new
parser.regexter('abbrs', /(?:[a-zA-Z]\.){2,}/) # return matched as is
parser.regexter(
'first regex',
/\p{L}+|(?!\.\d)[[:punct:]]/,
lambda do |token, regexp|
"(#{token})"
end
)
parser.parse("The U.S.A. is located in North America")
# => (The) U.S.A. (is) (located) (in) (North) (America)

Related

Finding and Editing Multiple Regex Matches on the Same Line

I want to add markdown to key phrases in a (gollum) wiki page that will link to the relevant wiki page in the form:
This is the key phrase.
Becomes
This is the [[key phrase|Glossary#key phrase]].
I have a list of key phrases such as:
keywords = ["golden retriever", "pomeranian", "cat"]
And a document:
Sue has 1 golden retriever. John has two cats.
Jennifer has one pomeranian. Joe has three pomeranians.
I want to iterate over every line and find every match (that isn't already a link) for each keyword. My current attempt looks like this:
File.foreach(target_file) do |line|
glosses.each do |gloss|
len = gloss.length
# Create the regex. Avoid anything that starts with [
# or (, ends with ] or ), and ignore case.
re = /(?<![\[\(])#{gloss}(?![\]\)])/i
# Find every instance of this gloss on this line.
positions = line.enum_for(:scan, re).map {Regexp.last_match.begin(0) }
positions.each do |pos|
line.insert(pos, "[[")
# +2 because we just inserted 2 ahead.
line.insert(pos+len+2, "|#{page}\##{gloss}]]")
end
end
puts line
end
However, this will run into a problem if there are two matches for the same key phrase on the same line. Because I insert things into the line, the position I found for each match isn't accurate after the first one. I know I could adjust for the size of my insertions every time but, because my insertions are a different size for each gloss, it seems like the most brute-force, hacky solution.
Is there a solution that allows me to make multiple insertions on the same line at the same time without several arbitrary adjustments each time?
After looking at #BryceDrew's online python version, I realized ruby probably also has a way to fill in the match. I now have a much more concise and faster solution.
First, I needed to make regexes of my glosses:
glosses.push(/(?<![\[\(])#{gloss}(?![\]\)])/i)
Note: The majority of that regex is look-ahead and look-behind assertions to prevent catching a phrase that's already part of a link.
Then, I needed to make a union of all of them:
re = Regexp.union(glosses)
After that, it's as simple as doing gsub on every line, and filling in my matches:
File.foreach(target_file) do |line|
line = line.gsub(re) {|match| "[[#{match}|Glossary##{match.downcase}]]"}
puts line
end

How to match full words and not substrings in Ruby

This is my code
stopwordlist = "a|an|all"
File.open('0_9.txt').each do |line|
line.downcase!
line.gsub!( /\b#{stopwordlist}\b/,'')
File.open('0_9_2.txt', 'w') { |f| f.write(line) }
end
I wanted to remove words - a,an and all
But, instead it matches substrings also and removes them
For an example input -
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life
I get the output -
bromwell high is cartoon comedy. it r t the same time s some other programs bout school life
As you can see, it matched the substring.
How do I make it just match the word and not substrings ?
The | operator in regex takes the widest scope possible. Your original regex matches either \ba or an or all\b.
Change the whole regex to:
/\b(?:#{stopwordlist})\b/
or change stopwordlist into a regex instead of a string.
stopwordlist = /a|an|all/
Even better, you may want to use Regexp.union.
\ba\b|\ban\b|\ball\b
try this.this will look for word boundaries.

How to extract first and last name from full name

I have a regex that, given the full name, is supposed to capture the first and last name. It should exclude the suffix, like "Jr.":
(.+)\s(.+(?!\sJr\.))
But this regex applied against the string Larry Farry Barry Jones Jr. gives the match:
1. Larry Farry Barry Jones
2. Jr.
Why is my negative lookahead failing to ignore the "Jr." when parsing the full name? I want match #2 to contain "Jones".
Rather than trying to do it with a single regex, I think the following would be a more maintainable code.
full_name = "Larry Farry Barry Jones Jr."
name_parts = full_name.split - ["Jr."]
first_name, last_name = name_parts[0], name_parts[-1]
As a comment mentions it is the first .* that matches most of the string. The use of look ahead seems in correct here, as you do not want to return that value and do not need it to be included in a further match.
The following will split all words up but not return the 'Jr.' So you could take the first and last result.
(\w+\s)+?(?!\sJr\.)
I recommend Rubular for practicing Ruby RegExp.
The reason is that your string is matched by your .+ till the end and then does the regex lookahead, there is no "Jr." following (because we are already at the end) ==> perfect, we match!!!
But that is because your pattern is wrong. Better would be this:
\S+(?:\s(?!Jr\.)\S+)*
See it here on Regexr
Means:
\S+ match a series of at least one non whitespace character.
(?:\s(?!Jr\.)\S+)* Non capturing group: Match a whitespace and then, if it is not "Jr.", match the next series of non whitespace characters. This complete group can be repeated 0 or more times.

Regular expression match that excludes characters inside parenthesis

I have the following types of strings.
BILL SMITH (USA)
WINTHROP (FR)
LORD AT WAR (GB)
KIM SMITH
With these strings, I have the following constraints:
1. all caps
2. can be 2 to 18 charters long
3. should not have any white spaces or carriage returns at the end
4. the country abbreviation inside the parens should be excluded
5. some of the names will not have the country in parens and they should be matched too
After applying my regular expression I'd like to get the following:
BILL SMITH (USA) => BILL SMITH
WINTHROP (FR) => WINTHROP
LORD AT WAR (GB) = LORD AT WAR
KIM SMITH => KIM SMITH
I came up with the following regular expression but I'm not getting any matches:
String.scan(\([A-Z \s*]{1,18})(^?!(\([A-Z]{1,3}\)))\)
I been banging my head on this for a while so if anyone can point error I'd appreciated it.
UPDATE:
I've gotten some great responses, however, so far none of the regular expression solutions have met all the constraints. The tricky part seems to be that some of the string has the country in parenthesis and some don't. In one case strings without the country was not being matched and in another it was returning the correct string along with the country abbreviation without the parenthesis. (See the comments on the second response.) One point of clarification: All of the strings that I will be matching will be the start point of the string. Not sure if that helps or not. Thanks again for all your help.
Here's one solution:
^((?:[A-Z]|\s){2,18}+?)(?:\s\([A-Z]+\))?$
See it on Rubular. Note that it counts 18 characters before the parenthesis - not sure how you want it to behave specificically. If you want to make sure the whole line isn't more than 18 characters, I suggest you just do unless line.length < 18 ... Similarly, if you want to make sure there is no whitespace at the end, I recommend you use line.strip. That'll greatly reduce the complexity of the Regexp you need and make your code more readable.
Edit: also works when no parentheses are used after the name.
The biggest error is that you wrote (^?!...) where you meant (?=...). The former means "an optional start-of-line anchor, followed by !, followed by ..., inside a capture group"; the latter means "a position in the string that is followed by ...". Fixing that, as well as makin a few other tweaks, and adding the requirement that the initial string end with a letter, we get:
[A-Z\s]{1,17}[A-Z])(?=\s*\([A-Z]{1,3}\)
Update based on OP comments: Since this will always match at the start of a string, you can use \A to "anchor" your pattern to the start of the string. You can then get rid of the lookahead assertion. This:
\A[A-Z][A-Z\s]{0,16}[A-Z]
matches start-of-string, followed by an uppercase letter, followed by up to 16 characters that are either uppercase letters or whitespace characters, followed by an uppercase letter.
You can also just use gsub to remove the part(s) you don't want. To remove everything in parenthesis you could do:
str.gsub(/\s*\([^)]*\)/, '')

Ruby regex: remove first name, leave last name

I am parsing a text and I want to ignore people's first names.
Examples (cases):
B.Obama => Obama
B. Obama => Obama
B . Obama => Obama
I manage to write this working Ruby regex:
"B.Obama".gsub(/\p{L}+\.(\p{L}+)/, '\\1')
However, it solves only one case. Also, it doesn't check, if the first letter is capital.
So, how should the regex, which combines all these cases, look like?
Details: Ruby 1.92 and UTF-8 strings.
I Gave a it a little bit more thought and I like this better:
/^(\w+)[ .,](.+$)/
This will capture both the first name and last name in different capturing groups
i.e.
"Mark del cato".scan /^(\w+)[ .,](.+$)/
see rubular for example: Rubular
Or Try
^[^ .]+
This will pick up the first word on a line. that is not delimited by a dot or a space.
Hope it helps, see example at Rubular
Try
(\w+)$
\w+ matches one or more 'word' characters.
The $ is a zero-length match matching the end of the string.
Do you want to be able to pull second names from a piece of text? That could get very difficult. Can you post an excerpt of your text?

Resources