My plan:
Get everything after Send to: and the end of that line.
Get everything between Attn: and the end of that line.
NOTE: The Attn line could be optional. In that case, just return the first line.
The string looks like this:
str = <<-MSG
Registry of Credit Recommendations
American Council on Education
One Dupont Circle, NW
Washington, D.C. 20036
Transcript Print Date: 10/03/2018
Sent By:Send To: American University
4400 Massachusetts Avenue, NW
Washington, DC 20016-8001
Attn: Undergraduate Admissions
Jonathan A Jones
30 People's Court
Second Address Line
Third Address Line
Augusta, GA 30909
MSG
Expected return value must be:
American University
Attn: Undergraduate Admissions
**Notice the "Attn: " part must be included, not just the content of it. **
Here is my approach, which only works for the Attn part, but I have no idea how to get the "American University" part.
regex = /Attn:([^\r\n]+)[\r\n]+/
Test: http://rubular.com/r/Px4ru6WrAg
Appreciate your help.
You could use an alternation
(?<=Send To:).*|Attn:.*
(?<=Send To:) Positive lookbehind to assert what is on the left is Send To:. Then match one or more times any character
| or
Attn:.+ Match Attn: followed by one or more times any characer
Regex demo
Note that you don't have to use a regular expression.
str.each_line.
map do |line|
case
when line.include?("Send To: ")
line[line.index("Send To: ") + "Send To: ".size..-2]
when line.include?("Attn: ")
line[line.index("Attn: ")..-2]
else
nil
end
end.compact
#=> ["American University", "Attn: Undergraduate Admissions"]
-2 excludes the newline character that ends each line.
Related
I am trying to extract possible author names from an article. I am working under the assumption that the author name is in a byline
"By FirstName LastName"
or
"By FirstName MiddleName LastName"
and the first, middle and last names all start with a capital letter.
How can I use a regex to extract all 2-3 word strings that follow "By", that also meet the above conditions?
For instance, if the article has the text
"By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president"
it would extract
"Barack Obama"
and
"January"
as possible author names, and I will then do the work of determining which is the right one.
Currently my regex is:
/By ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/
However, when I use this on the string
"By Alex Jackson Olerud"
it seems to return both
"Alex Jackson Olerud"
and
" Olerud"
I am using Ruby as my preferred language, but any language-agnostic solution would suffice.
Here's my suggestion:
str = "By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president.
By A. B. Cecil"
def find_authors(str)
str.scan(/
(?<name> # a named capture group for one of the names
\p{Lu} # starts with an upper case letter, unicode so will work also for e.g. Åsa
(?: \. | \p{Ll}+) # followed by a period or some lower case letters
){0} # zero matches, this is just a subroutine to be used again
(?<=[Bb]y\s) # lookbehind to make sure the author is after a by or By
(?<wholename> # capture group to extract the whole name
\g<name> (\s \g<name>){1,2} # a name should have a least two components
)/x).map(&:last) # remove the match by the <name> group from the result
end
def find_authors_oneline(str)
str.scan(/(?<name>\p{Lu}(?:\.|\p{Ll}+)){0}(?<=[Bb]y\s)(?<wholename>\g<name>(\s\g<name>){1,2})/).map(&:last)
end
p find_authors str
>> ["Barack Obama", "A. B. Cecil"]
p find_authors_oneline str
>> ["Barack Obama", "A. B. Cecil"]
You can read about regex subroutines and the regex /x modifier
I think the second capture group (\s+[A-Z][\w-]*) is throwing you off. Try using a non-capture group like (?:\s+[A-Z][\w-]*)
str = "By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president"
str.scan(/(?:By )((?:[A-Z][A-Za-z]+ ?+)+)/).flatten.map(&:strip)
#=> ["Barack Obama", "January"]
First of all, full disclosure, I am working on a homework assignment. The example I'm giving is not the exact problem, but will help me understand what I need to do. I'm not looking for a spoon-fed answer but to understand what is going on.
I am trying to take a string such as:
"The Civil War started in 1861."
"The American Revolution started in 1775."
In this example I would like to return the same string, but with the appropriate century in parenthesis after
"The Civil War started in 1861. (Nineteenth Century)"
"The American Revolution started in 1775. (Eighteenth Century)"
I am able to group what I need using the following regex
text.gsub!(/([\w ]*)(1861|1775).?/, '\1\2 (NOT SURE HERE)')
It would be easy using grouping to say if \2 == 1861 append appropriate century, but the specifications say no if statements may be used and I am very lost. Also, the alternation I used in this example only works for the 2 years listed and I know that a better form of range-matching would have to be used to catch full centuries as opposed to those 2 single years.
Firstly - how to remove the hardcoding of the years:
text.gsub!(/([\w ]*)([012]\d{3}).?/, '\1\2 (NOT SURE HERE)')
This should handle things for the next ~1k years. If you know for a fact that the dates are restricted to given periods, you can be more specific.
For the other part - the century is just the first two digits plus one. So split the year in two and increment.
text.gsub(/[\w ]*([012]\d)\d\d.?/) do |sentence|
"#{sentence} (#{$1.next}th Century)"
end
Note the usage of String#gsub with block due to the fact that we need to perform a transformation on one of the matched groups.
Update: if you want the centuries to be in words, you could use an array to store them.
ordinals = %w(
First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Eleventh
Twelfth Thirteenth Fourteenth Fifteenth Sixteenth Seventeenth Eighteenth
Nineteenth Twentieth Twenty–First
)
text.gsub(/[\w ]*([012]\d)\d\d.?/) do |sentence|
"#{sentence} (#{ordinals[$1.to_i]} Century)"
end
Update (2): Assuming you want to replace something completely different and you can't take advantage of number niceties like in the centuries example, implement the same general idea, just use a hash instead of array:
replacements = {'cat' => 'king', 'mat' => 'throne'}
"The cat sat on the mat.".gsub(/^(\w+ )(\w+)([\w ]+ )(\w+)\.$/) do
"#{$1}#{replacements[$2]}#{$3}#{replacements[$4]}."
end
Assuming the year is between 1 and 2099, you might do it as follows.
YEAR_TO_CENTURY = (1..21).to_a.zip(%w| First Second Third Fourth Fifth Sixth
Seventh Eighth Ninth Tenth Eleventh Twelfth Thriteenth Fourteenth Fifteenth
Sixteenth Seventeenth Eighteenth Nineteenth Twentieth Twentyfirst | ).to_h
#=> { 1=>"First", 2=>"Second", 3=>"Third", 4=>"Fourth", 5=>"Fifth", 6=>"Sixth",
# 7=>"Seventh", 8=>"Eighth", 9=>"Ninth", 10=>"Tenth", 11=>"Eleventh",
# 12=>"Twelfth", 13=>"Thriteenth", 14=>"Fourteenth", 15=>"Fifteenth",
# 16=>"Sixteenth", 17=>"Seventeenth", 18=>"Eighteenth", 19=>"Nineteenth",
# 20=>"Twentieth", 21=>"Twentyfirst" }
def centuryize(str)
str << " (%s Century)" % YEAR_TO_CENTURY[(str[/\d+(?=\.)/].to_i/100.0).ceil]
end
centuryize "The American Revolution started in 1775."
#=> "The American Revolution started in 1775. (Eighteenth Century)"
centuryize "The Battle of Hastings took place in 1066."
#=> "The Battle of Hastings took place in 1066. (Eleventh Century)"
centuryize "Nero played the fiddle while Rome burned in AD 64."
#=> "Nero played the fiddle while Rome burned in AD 64. (First Century)"
It would be easier if we could write "19th" century.
def centuryize(str)
century = (str[/\d+(?=\.)/].to_i/100.0).ceil
suffix =
case century
when 1, 21 then "st"
when 2 then "nd"
when 3 then "rd"
else "th"
end
"%s (%d%s Century)" % [str, century, suffix]
end
centuryize "The American Revolution started in 1775."
# => "The American Revolution started in 1775. (18th Century)"
centuryize "The Battle of Hastings took place in 1066."
#=> "The Battle of Hastings took place in 1066. (11th Century)"
centuryize "Nero played the fiddle while Rome burned in AD 64."
#=> "Nero played the fiddle while Rome burned in AD 64. (1st Century)"
I am trying to write a script that parses filename of a comicbook and tries to extract info such as Seriesname, Publication year etc.In this case, I am trying to extract publication year from the name. Consider the following name, I would need to match and get value 2003. Below is the expression I had for this.
r = %r{ (?i)(^|[,\s-_])v(\d{4})($|[,\s-_]) }
However this matches the number irrespective of what character I have before the v or after the number
I expect the first two to not match and the third to match.
010 - All Star Batman & Robin The Boy Wonder 01 - av2003
010 - All Star Batman & Robin The Boy Wonder 01 - v2003t
010 - All Star Batman & Robin The Boy Wonder 01 - v2003
What am I doing wrong in this case?
Inside character classes (ie. []s) the - character has a special meaning when it's between two other characters: it creates a range starting the character before and ending at the character after.
Here, you want it literally, so you should either escape the - or (more idiomatically in regex) put it as the first or last character in the [].
Also, btw, you have literal space characters, but no /x modifier, also you probably don't want to capture what's before and after the year, so the final pattern would be:
%r{(?i)(?:^|[,\s_-])v(\d{4})(?:$|[,\s_-])}
#smathy answered your question (rather nicely). I want to point out that you could write your regex without a capture group:
r = /
(?: # begin a non-capture group
^|[,\s_-] # match the beginning of the string, a ws char or char in ',_-'
) # end the non-capture group
v # match v
\K # forget everything matched so far
\d{4} # match 4 digits
(?= # begin a positive look-ahead
$|[,\s_-] # match the end of the string, a ws char or char in ',_-'
) # end positive lookahead
/x
"010 - All Star Batman & Robin The Boy Wonder 01 - av2003"[r]
#=> nil
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003t"[r]
#=> nil
"010 - All Star Batman & Robin The Boy Wonder 01 - v2003"[r]
#=> "2003
If you wish to match v or V, change the line v to [vV].
If you wish the regex to be case independent, change /x to /ix (in which case there is no need to replace v with [vV]).
If you wish to ensure the publication date is (say) in the 20th or 21st century, change \d{4} to [12]\d{3}.
You could alternatively change the non-capture group to a positive lookbehind ((?<=^|[,\s_-])) and delete \K.
Anytime a strings contains a capital letter followed by a period, I'd like to replace the capital letter and period with just the capital letter.
Today MR. Johnson walked to the mail box.
=> Today MR Johnson walked to the mail box.
William SR. won the race.
=> William SR won the race.
I tried to accomplish this using gsub:
MyText = "William SR. won the race."
MyText = MyText.gsub(/[A-Z]\./,**I DON'T KNOW WHAT TO PUT HERE**]
I can match the capital letter followed by the period, but I can't figure out how to replace my match with the capital letter that precedes the period.
An another way without lookaround and using a capture group:
MyText = MyText.gsub(/([A-Z])\./,'\1')
You should use a positive look behind to match it and replace it with nothing.
MyText = "William SR. won the race."
MyText = MyText.gsub(/(?<=[A-Z])\./, '')
Here is an example of it on Rubular. You could just use gsub! if you know you want to do the replacement in place instead of making a copy.
I am trying to extract a US address from a text.
So if I have the following variations of text then I'd like to extract the address portion
Today is a good day to meet up at a
bar. the address is 123 fake street,
NY, 23423-3423
just came from 423 Elm Street, kk, 34223 ...had awesome time
blah blah bleh blah 23414 Fake Terrace, MM something else
experimented my teleporter to get to work but reached at 2423 terrace NY
If someone can provide some starting points then I can mold it for other variations.
At some point, you'd have clarify what you consider an address to be.
Does an address just have a street number and street name?
Does an address have a street name, and a city name?
Does an address have a city name, a state name?
Does an address have a city name, a state abbreviation, and a zip code? What format is the zip code in?
It's easy to see how you can run into trouble quickly.
This obviously wouldn't catch everything, but maybe you could match strings that start with a street number, has a state abbreviation in the middle somewhere, and end in a zip code. The reliability of this would greatly depend on knowing what sort of text you were using as the input. I.e., if there is a lot of other numbers in the text, this could be completely useless.
possible regex
\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?
sample input
hello world this is me posting an address. please go to 312 N whatever st., New York NY 10001.
If you can find me there. I might be at 123 Invalid address.
Please send all letters to 115A Address Street, Suite 100, Google KS, 66601
42 NE Another Address, Some City with 9 digit zip, AK 55555-2143
Hope this helps!
matches
312 N whatever st., New York NY 10001
115A Address Street, Suite 100, Google KS, 66601
42 NE Another Address, Some City with 9 digit zip, AK 55555-2143
regex explanation
\d+ digits (0-9) (1 or more times (matching the most amount possible))
.+ any character except \n (1 or more times (matching the most amount possible))
(?= look ahead to see if there is:
AL|AK|AS|... 'AL', 'AK', 'AS', ... (valid state abbreviations)
) end of look-ahead
[A-Z]{2} any character of: 'A' to 'Z' (2 times)
[, ]+ any character of: ',', ' ' (1 or more times (matching the most amount possible))
\d{5} digits (0-9) (5 times)
(?: group, but do not capture (optional (matching the most amount possible)):
- '-'
\d{4} digits (0-9) (4 times)
)? end of grouping