First of all, full disclosure, I am working on a homework assignment. The example I'm giving is not the exact problem, but will help me understand what I need to do. I'm not looking for a spoon-fed answer but to understand what is going on.
I am trying to take a string such as:
"The Civil War started in 1861."
"The American Revolution started in 1775."
In this example I would like to return the same string, but with the appropriate century in parenthesis after
"The Civil War started in 1861. (Nineteenth Century)"
"The American Revolution started in 1775. (Eighteenth Century)"
I am able to group what I need using the following regex
text.gsub!(/([\w ]*)(1861|1775).?/, '\1\2 (NOT SURE HERE)')
It would be easy using grouping to say if \2 == 1861 append appropriate century, but the specifications say no if statements may be used and I am very lost. Also, the alternation I used in this example only works for the 2 years listed and I know that a better form of range-matching would have to be used to catch full centuries as opposed to those 2 single years.
Firstly - how to remove the hardcoding of the years:
text.gsub!(/([\w ]*)([012]\d{3}).?/, '\1\2 (NOT SURE HERE)')
This should handle things for the next ~1k years. If you know for a fact that the dates are restricted to given periods, you can be more specific.
For the other part - the century is just the first two digits plus one. So split the year in two and increment.
text.gsub(/[\w ]*([012]\d)\d\d.?/) do |sentence|
"#{sentence} (#{$1.next}th Century)"
end
Note the usage of String#gsub with block due to the fact that we need to perform a transformation on one of the matched groups.
Update: if you want the centuries to be in words, you could use an array to store them.
ordinals = %w(
First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Eleventh
Twelfth Thirteenth Fourteenth Fifteenth Sixteenth Seventeenth Eighteenth
Nineteenth Twentieth Twenty–First
)
text.gsub(/[\w ]*([012]\d)\d\d.?/) do |sentence|
"#{sentence} (#{ordinals[$1.to_i]} Century)"
end
Update (2): Assuming you want to replace something completely different and you can't take advantage of number niceties like in the centuries example, implement the same general idea, just use a hash instead of array:
replacements = {'cat' => 'king', 'mat' => 'throne'}
"The cat sat on the mat.".gsub(/^(\w+ )(\w+)([\w ]+ )(\w+)\.$/) do
"#{$1}#{replacements[$2]}#{$3}#{replacements[$4]}."
end
Assuming the year is between 1 and 2099, you might do it as follows.
YEAR_TO_CENTURY = (1..21).to_a.zip(%w| First Second Third Fourth Fifth Sixth
Seventh Eighth Ninth Tenth Eleventh Twelfth Thriteenth Fourteenth Fifteenth
Sixteenth Seventeenth Eighteenth Nineteenth Twentieth Twentyfirst | ).to_h
#=> { 1=>"First", 2=>"Second", 3=>"Third", 4=>"Fourth", 5=>"Fifth", 6=>"Sixth",
# 7=>"Seventh", 8=>"Eighth", 9=>"Ninth", 10=>"Tenth", 11=>"Eleventh",
# 12=>"Twelfth", 13=>"Thriteenth", 14=>"Fourteenth", 15=>"Fifteenth",
# 16=>"Sixteenth", 17=>"Seventeenth", 18=>"Eighteenth", 19=>"Nineteenth",
# 20=>"Twentieth", 21=>"Twentyfirst" }
def centuryize(str)
str << " (%s Century)" % YEAR_TO_CENTURY[(str[/\d+(?=\.)/].to_i/100.0).ceil]
end
centuryize "The American Revolution started in 1775."
#=> "The American Revolution started in 1775. (Eighteenth Century)"
centuryize "The Battle of Hastings took place in 1066."
#=> "The Battle of Hastings took place in 1066. (Eleventh Century)"
centuryize "Nero played the fiddle while Rome burned in AD 64."
#=> "Nero played the fiddle while Rome burned in AD 64. (First Century)"
It would be easier if we could write "19th" century.
def centuryize(str)
century = (str[/\d+(?=\.)/].to_i/100.0).ceil
suffix =
case century
when 1, 21 then "st"
when 2 then "nd"
when 3 then "rd"
else "th"
end
"%s (%d%s Century)" % [str, century, suffix]
end
centuryize "The American Revolution started in 1775."
# => "The American Revolution started in 1775. (18th Century)"
centuryize "The Battle of Hastings took place in 1066."
#=> "The Battle of Hastings took place in 1066. (11th Century)"
centuryize "Nero played the fiddle while Rome burned in AD 64."
#=> "Nero played the fiddle while Rome burned in AD 64. (1st Century)"
Related
My plan:
Get everything after Send to: and the end of that line.
Get everything between Attn: and the end of that line.
NOTE: The Attn line could be optional. In that case, just return the first line.
The string looks like this:
str = <<-MSG
Registry of Credit Recommendations
American Council on Education
One Dupont Circle, NW
Washington, D.C. 20036
Transcript Print Date: 10/03/2018
Sent By:Send To: American University
4400 Massachusetts Avenue, NW
Washington, DC 20016-8001
Attn: Undergraduate Admissions
Jonathan A Jones
30 People's Court
Second Address Line
Third Address Line
Augusta, GA 30909
MSG
Expected return value must be:
American University
Attn: Undergraduate Admissions
**Notice the "Attn: " part must be included, not just the content of it. **
Here is my approach, which only works for the Attn part, but I have no idea how to get the "American University" part.
regex = /Attn:([^\r\n]+)[\r\n]+/
Test: http://rubular.com/r/Px4ru6WrAg
Appreciate your help.
You could use an alternation
(?<=Send To:).*|Attn:.*
(?<=Send To:) Positive lookbehind to assert what is on the left is Send To:. Then match one or more times any character
| or
Attn:.+ Match Attn: followed by one or more times any characer
Regex demo
Note that you don't have to use a regular expression.
str.each_line.
map do |line|
case
when line.include?("Send To: ")
line[line.index("Send To: ") + "Send To: ".size..-2]
when line.include?("Attn: ")
line[line.index("Attn: ")..-2]
else
nil
end
end.compact
#=> ["American University", "Attn: Undergraduate Admissions"]
-2 excludes the newline character that ends each line.
I am trying to extract possible author names from an article. I am working under the assumption that the author name is in a byline
"By FirstName LastName"
or
"By FirstName MiddleName LastName"
and the first, middle and last names all start with a capital letter.
How can I use a regex to extract all 2-3 word strings that follow "By", that also meet the above conditions?
For instance, if the article has the text
"By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president"
it would extract
"Barack Obama"
and
"January"
as possible author names, and I will then do the work of determining which is the right one.
Currently my regex is:
/By ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/
However, when I use this on the string
"By Alex Jackson Olerud"
it seems to return both
"Alex Jackson Olerud"
and
" Olerud"
I am using Ruby as my preferred language, but any language-agnostic solution would suffice.
Here's my suggestion:
str = "By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president.
By A. B. Cecil"
def find_authors(str)
str.scan(/
(?<name> # a named capture group for one of the names
\p{Lu} # starts with an upper case letter, unicode so will work also for e.g. Åsa
(?: \. | \p{Ll}+) # followed by a period or some lower case letters
){0} # zero matches, this is just a subroutine to be used again
(?<=[Bb]y\s) # lookbehind to make sure the author is after a by or By
(?<wholename> # capture group to extract the whole name
\g<name> (\s \g<name>){1,2} # a name should have a least two components
)/x).map(&:last) # remove the match by the <name> group from the result
end
def find_authors_oneline(str)
str.scan(/(?<name>\p{Lu}(?:\.|\p{Ll}+)){0}(?<=[Bb]y\s)(?<wholename>\g<name>(\s\g<name>){1,2})/).map(&:last)
end
p find_authors str
>> ["Barack Obama", "A. B. Cecil"]
p find_authors_oneline str
>> ["Barack Obama", "A. B. Cecil"]
You can read about regex subroutines and the regex /x modifier
I think the second capture group (\s+[A-Z][\w-]*) is throwing you off. Try using a non-capture group like (?:\s+[A-Z][\w-]*)
str = "By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president"
str.scan(/(?:By )((?:[A-Z][A-Za-z]+ ?+)+)/).flatten.map(&:strip)
#=> ["Barack Obama", "January"]
I have the following sentence:
"We bought 3.5 million shirts."
I want to create an array with all of the words and punctuation but not the number including the decimal point.
I have the following regex:
/[\D]+/
However this still grabs the decimal point between the numbers as follows:
["We", "bought", ".", "million", "shirts", "."]
I want the result to be as follows: looking for the following result:
["We", "bought", "million", "shirts", "."]
Notice that the "." from the number is excluded.
How can I still select periods at the end of sentences but not decimal points that occur before a number?
I suggest using a small enhancement: replace \D+ with \p{L}+ (or [[:alpha:]]+) to only match 1+ letters and then restrict [[:punct:]] to only match if it is not a . followed with a digit (with a negative lookahead (?!\.\d)):
s = "We bought 3.5 million shirts."
res = s.scan(/\p{L}+|(?!\.\d)[[:punct:]]/)
puts res # => [We, bought, million, shirts, .]
See the Ruby demo
Another approach is to first remove all numbers with \d*\.?\d+ regex and then collect the "words" with punctuation:
s = "We bought 3.5 million shirts."
res = s.gsub(/\d*\.?\d+/, '').scan(/\w+|\p{P}/)
See this Ruby demo
Try this
str = "We bought 3.5 million shirts."
str.scan(/[[:alpha:]]+|[[:punct:]](?![[:digit::]])/)
# => ["We", "bought", "million", "shirts", "."]
How does this work?
[[:alpha:]]+ selects one or more letters, aka words
[[:punct:]](?![[:digit::]]) selects punctation that is not followed by a number
You can try this:
a="We bought 3.5 million shirts 15 dolalr.;"
b=a.split(/\s+\d*\.?\d*\s*|([.,;])|[\s]+/)
puts b
Try it here
Output array:
We
bought
million
shirts
dolalr
.
I'm parsing a pdf that has some dates by splitting the lines and then searching them. The following are example lines:
Posted Date: 02/11/2015
Effective Date: 02/05/2015
When I find Posted Date, I split on the : and pull out 02/11/2015. But when I do the same for effective date, it only returns /05/2015. When I write all lines, it displays that date as /05/2015 while the PDF has the 02. Would 02 be converted to nil for some reason? Am I missing something?
lines = reader.pages[0].text.split(/\r?\n/)
lines.each_with_index do |line, index|
values_to_insert = []
if line.include? "Legal Name:"
name_line = line.split(":")
values_to_insert.push(name_line[1])
end
if line.include? "Active/Pending Insurance"
topLine = lines[index+2].split(" ")
middleLine = lines[index+5].split(" ")
insuranceLine = lines[index + 7]
insurance_line_split = insuranceLine.split(" ")
insurance_line_split.each_with_index do |word, i|
if word.include? "Insurance"
values_to_insert.push(insuranceLine.split(":")[1])
end
end
topLine.each_with_index do |word, i|
if word.include? "Posted"
values_to_insert.push(topLine[i + 2])
end
end
middleLine.each_with_index do |word, i|
if word.include? "Effective" or word.include? "Cancellation"
#puts middleLine[0]
puts middleLine[1]
#puts middleLine[i + 1].split(":")[1]
end
end
end
end
Here is what happens when I print all lines:
Active/Pending Insurance:
Form: 91X Type: BIPD/Primary Posted Date: 02/11
/2015
Policy/Surety Number:A 3491819 Coverage From: $0
To: $1,000,000
Effective Date:/05/2015 Cancellation Date:
Insurance Carrier: PROGRESSIVE EXPRESS INSURANCE COMPANY
Attn: CUSTOMER SERVICE
Address: P. O. BOX 94739
CLEVELAND, OH 44101 US
Telephone: (800) 444 - 4487 Fax: (440) 603 - 4555
Edited to show the code and even add a picture. I'm splitting by lines and then splitting again on colons and sometimes spaces. It's not amazingly clean but I don't think there's a much better way.
The problem occurs at positions where multiple pieces of text are on the same line but don't use exactly the same base line. In case of the PDF at hands,
(at least) the policy number and the effective date are positioned slightly higher than their respective labels.
The cause for this is the way the pdf-reader library used by the OP brings together the text pieces drawn on the page:
It determines a number of columns and rows to arrange the letters in and
creates an array of the rows number of strings filled with the columns number of spaces.
It then combines consecutive text pieces from the PDF on exactly the same base line and
finally puts these combined text pieces into the string array starting from the position best matching their starting position in the PDF.
As fonts used in PDFs usually are not monospaced, this procedure can result in overlapping strings, i.e. erasure of one of the two. The step combining strings on the same baseline prevents erasure in that case, but for strings on slightly different base lines, this overlapping effect can still occur.
What one can do, is increase the number of columns used here.
The library in page_layout.rb defines
def col_count
#col_count ||= ((#page_width / #mean_glyph_width) * 1.05).floor
end
As you see there already is some magic number 1.05 in use to slightly increase the number of columns. By increasing this number even more, no erasures as observed by the OP should occur anymore. One should not increase the factor too much, though, because that can introduce unwanted space characters where none belong.
The OP reported that increasing the magic number to 1.10 sufficed in his case.
I am trying to extract a US address from a text.
So if I have the following variations of text then I'd like to extract the address portion
Today is a good day to meet up at a
bar. the address is 123 fake street,
NY, 23423-3423
just came from 423 Elm Street, kk, 34223 ...had awesome time
blah blah bleh blah 23414 Fake Terrace, MM something else
experimented my teleporter to get to work but reached at 2423 terrace NY
If someone can provide some starting points then I can mold it for other variations.
At some point, you'd have clarify what you consider an address to be.
Does an address just have a street number and street name?
Does an address have a street name, and a city name?
Does an address have a city name, a state name?
Does an address have a city name, a state abbreviation, and a zip code? What format is the zip code in?
It's easy to see how you can run into trouble quickly.
This obviously wouldn't catch everything, but maybe you could match strings that start with a street number, has a state abbreviation in the middle somewhere, and end in a zip code. The reliability of this would greatly depend on knowing what sort of text you were using as the input. I.e., if there is a lot of other numbers in the text, this could be completely useless.
possible regex
\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?
sample input
hello world this is me posting an address. please go to 312 N whatever st., New York NY 10001.
If you can find me there. I might be at 123 Invalid address.
Please send all letters to 115A Address Street, Suite 100, Google KS, 66601
42 NE Another Address, Some City with 9 digit zip, AK 55555-2143
Hope this helps!
matches
312 N whatever st., New York NY 10001
115A Address Street, Suite 100, Google KS, 66601
42 NE Another Address, Some City with 9 digit zip, AK 55555-2143
regex explanation
\d+ digits (0-9) (1 or more times (matching the most amount possible))
.+ any character except \n (1 or more times (matching the most amount possible))
(?= look ahead to see if there is:
AL|AK|AS|... 'AL', 'AK', 'AS', ... (valid state abbreviations)
) end of look-ahead
[A-Z]{2} any character of: 'A' to 'Z' (2 times)
[, ]+ any character of: ',', ' ' (1 or more times (matching the most amount possible))
\d{5} digits (0-9) (5 times)
(?: group, but do not capture (optional (matching the most amount possible)):
- '-'
\d{4} digits (0-9) (4 times)
)? end of grouping