how to extract address from a text using regex in Ruby - ruby

I am trying to extract a US address from a text.
So if I have the following variations of text then I'd like to extract the address portion
Today is a good day to meet up at a
bar. the address is 123 fake street,
NY, 23423-3423
just came from 423 Elm Street, kk, 34223 ...had awesome time
blah blah bleh blah 23414 Fake Terrace, MM something else
experimented my teleporter to get to work but reached at 2423 terrace NY
If someone can provide some starting points then I can mold it for other variations.

At some point, you'd have clarify what you consider an address to be.
Does an address just have a street number and street name?
Does an address have a street name, and a city name?
Does an address have a city name, a state name?
Does an address have a city name, a state abbreviation, and a zip code? What format is the zip code in?
It's easy to see how you can run into trouble quickly.
This obviously wouldn't catch everything, but maybe you could match strings that start with a street number, has a state abbreviation in the middle somewhere, and end in a zip code. The reliability of this would greatly depend on knowing what sort of text you were using as the input. I.e., if there is a lot of other numbers in the text, this could be completely useless.
possible regex
\d+.+(?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2}[, ]+\d{5}(?:-\d{4})?
sample input
hello world this is me posting an address. please go to 312 N whatever st., New York NY 10001.
If you can find me there. I might be at 123 Invalid address.
Please send all letters to 115A Address Street, Suite 100, Google KS, 66601
42 NE Another Address, Some City with 9 digit zip, AK 55555-2143
Hope this helps!
matches
312 N whatever st., New York NY 10001
115A Address Street, Suite 100, Google KS, 66601
42 NE Another Address, Some City with 9 digit zip, AK 55555-2143
regex explanation
\d+ digits (0-9) (1 or more times (matching the most amount possible))
.+ any character except \n (1 or more times (matching the most amount possible))
(?= look ahead to see if there is:
AL|AK|AS|... 'AL', 'AK', 'AS', ... (valid state abbreviations)
) end of look-ahead
[A-Z]{2} any character of: 'A' to 'Z' (2 times)
[, ]+ any character of: ',', ' ' (1 or more times (matching the most amount possible))
\d{5} digits (0-9) (5 times)
(?: group, but do not capture (optional (matching the most amount possible)):
- '-'
\d{4} digits (0-9) (4 times)
)? end of grouping

Related

Ruby regex - single match across lines

My plan:
Get everything after Send to: and the end of that line.
Get everything between Attn: and the end of that line.
NOTE: The Attn line could be optional. In that case, just return the first line.
The string looks like this:
str = <<-MSG
Registry of Credit Recommendations
American Council on Education
One Dupont Circle, NW
Washington, D.C. 20036
Transcript Print Date: 10/03/2018
Sent By:Send To: American University
4400 Massachusetts Avenue, NW
Washington, DC 20016-8001
Attn: Undergraduate Admissions
Jonathan A Jones
30 People's Court
Second Address Line
Third Address Line
Augusta, GA 30909
MSG
Expected return value must be:
American University
Attn: Undergraduate Admissions
**Notice the "Attn: " part must be included, not just the content of it. **
Here is my approach, which only works for the Attn part, but I have no idea how to get the "American University" part.
regex = /Attn:([^\r\n]+)[\r\n]+/
Test: http://rubular.com/r/Px4ru6WrAg
Appreciate your help.
You could use an alternation
(?<=Send To:).*|Attn:.*
(?<=Send To:) Positive lookbehind to assert what is on the left is Send To:. Then match one or more times any character
| or
Attn:.+ Match Attn: followed by one or more times any characer
Regex demo
Note that you don't have to use a regular expression.
str.each_line.
map do |line|
case
when line.include?("Send To: ")
line[line.index("Send To: ") + "Send To: ".size..-2]
when line.include?("Attn: ")
line[line.index("Attn: ")..-2]
else
nil
end
end.compact
#=> ["American University", "Attn: Undergraduate Admissions"]
-2 excludes the newline character that ends each line.

What is a regex to extract words and punctuation but ignore decimals and numbers?

I have the following sentence:
"We bought 3.5 million shirts."
I want to create an array with all of the words and punctuation but not the number including the decimal point.
I have the following regex:
/[\D]+/
However this still grabs the decimal point between the numbers as follows:
["We", "bought", ".", "million", "shirts", "."]
I want the result to be as follows: looking for the following result:
["We", "bought", "million", "shirts", "."]
Notice that the "." from the number is excluded.
How can I still select periods at the end of sentences but not decimal points that occur before a number?
I suggest using a small enhancement: replace \D+ with \p{L}+ (or [[:alpha:]]+) to only match 1+ letters and then restrict [[:punct:]] to only match if it is not a . followed with a digit (with a negative lookahead (?!\.\d)):
s = "We bought 3.5 million shirts."
res = s.scan(/\p{L}+|(?!\.\d)[[:punct:]]/)
puts res # => [We, bought, million, shirts, .]
See the Ruby demo
Another approach is to first remove all numbers with \d*\.?\d+ regex and then collect the "words" with punctuation:
s = "We bought 3.5 million shirts."
res = s.gsub(/\d*\.?\d+/, '').scan(/\w+|\p{P}/)
See this Ruby demo
Try this
str = "We bought 3.5 million shirts."
str.scan(/[[:alpha:]]+|[[:punct:]](?![[:digit::]])/)
# => ["We", "bought", "million", "shirts", "."]
How does this work?
[[:alpha:]]+ selects one or more letters, aka words
[[:punct:]](?![[:digit::]]) selects punctation that is not followed by a number
You can try this:
a="We bought 3.5 million shirts 15 dolalr.;"
b=a.split(/\s+\d*\.?\d*\s*|([.,;])|[\s]+/)
puts b
Try it here
Output array:
We
bought
million
shirts
dolalr
.

Ruby Regex gsub! without using if

First of all, full disclosure, I am working on a homework assignment. The example I'm giving is not the exact problem, but will help me understand what I need to do. I'm not looking for a spoon-fed answer but to understand what is going on.
I am trying to take a string such as:
"The Civil War started in 1861."
"The American Revolution started in 1775."
In this example I would like to return the same string, but with the appropriate century in parenthesis after
"The Civil War started in 1861. (Nineteenth Century)"
"The American Revolution started in 1775. (Eighteenth Century)"
I am able to group what I need using the following regex
text.gsub!(/([\w ]*)(1861|1775).?/, '\1\2 (NOT SURE HERE)')
It would be easy using grouping to say if \2 == 1861 append appropriate century, but the specifications say no if statements may be used and I am very lost. Also, the alternation I used in this example only works for the 2 years listed and I know that a better form of range-matching would have to be used to catch full centuries as opposed to those 2 single years.
Firstly - how to remove the hardcoding of the years:
text.gsub!(/([\w ]*)([012]\d{3}).?/, '\1\2 (NOT SURE HERE)')
This should handle things for the next ~1k years. If you know for a fact that the dates are restricted to given periods, you can be more specific.
For the other part - the century is just the first two digits plus one. So split the year in two and increment.
text.gsub(/[\w ]*([012]\d)\d\d.?/) do |sentence|
"#{sentence} (#{$1.next}th Century)"
end
Note the usage of String#gsub with block due to the fact that we need to perform a transformation on one of the matched groups.
Update: if you want the centuries to be in words, you could use an array to store them.
ordinals = %w(
First Second Third Fourth Fifth Sixth Seventh Eighth Ninth Tenth Eleventh
Twelfth Thirteenth Fourteenth Fifteenth Sixteenth Seventeenth Eighteenth
Nineteenth Twentieth Twenty–First
)
text.gsub(/[\w ]*([012]\d)\d\d.?/) do |sentence|
"#{sentence} (#{ordinals[$1.to_i]} Century)"
end
Update (2): Assuming you want to replace something completely different and you can't take advantage of number niceties like in the centuries example, implement the same general idea, just use a hash instead of array:
replacements = {'cat' => 'king', 'mat' => 'throne'}
"The cat sat on the mat.".gsub(/^(\w+ )(\w+)([\w ]+ )(\w+)\.$/) do
"#{$1}#{replacements[$2]}#{$3}#{replacements[$4]}."
end
Assuming the year is between 1 and 2099, you might do it as follows.
YEAR_TO_CENTURY = (1..21).to_a.zip(%w| First Second Third Fourth Fifth Sixth
Seventh Eighth Ninth Tenth Eleventh Twelfth Thriteenth Fourteenth Fifteenth
Sixteenth Seventeenth Eighteenth Nineteenth Twentieth Twentyfirst | ).to_h
#=> { 1=>"First", 2=>"Second", 3=>"Third", 4=>"Fourth", 5=>"Fifth", 6=>"Sixth",
# 7=>"Seventh", 8=>"Eighth", 9=>"Ninth", 10=>"Tenth", 11=>"Eleventh",
# 12=>"Twelfth", 13=>"Thriteenth", 14=>"Fourteenth", 15=>"Fifteenth",
# 16=>"Sixteenth", 17=>"Seventeenth", 18=>"Eighteenth", 19=>"Nineteenth",
# 20=>"Twentieth", 21=>"Twentyfirst" }
def centuryize(str)
str << " (%s Century)" % YEAR_TO_CENTURY[(str[/\d+(?=\.)/].to_i/100.0).ceil]
end
centuryize "The American Revolution started in 1775."
#=> "The American Revolution started in 1775. (Eighteenth Century)"
centuryize "The Battle of Hastings took place in 1066."
#=> "The Battle of Hastings took place in 1066. (Eleventh Century)"
centuryize "Nero played the fiddle while Rome burned in AD 64."
#=> "Nero played the fiddle while Rome burned in AD 64. (First Century)"
It would be easier if we could write "19th" century.
def centuryize(str)
century = (str[/\d+(?=\.)/].to_i/100.0).ceil
suffix =
case century
when 1, 21 then "st"
when 2 then "nd"
when 3 then "rd"
else "th"
end
"%s (%d%s Century)" % [str, century, suffix]
end
centuryize "The American Revolution started in 1775."
# => "The American Revolution started in 1775. (18th Century)"
centuryize "The Battle of Hastings took place in 1066."
#=> "The Battle of Hastings took place in 1066. (11th Century)"
centuryize "Nero played the fiddle while Rome burned in AD 64."
#=> "Nero played the fiddle while Rome burned in AD 64. (1st Century)"

Matching repeated pattern in string

I have street names and numbers in a file, like so:
Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29
I parse the lines one by one with regex. I want a regex that will find and match:
The name of the street,
The street numbers with its possible a,b,c,d attached.
I've come up with this mean while:
/(\D{2,})\s+(\d{1,3}[a-d|א-ד]?)(?:[,\s]{1,3})?/
It finds the street name and first number. I need to find all the numbers.
I don't want to use two separate regex's if possible, and I prefer not to use Ruby's scan but just have it in one regex.
You can use regex to find all the numbers, with their separators:
re = /\A(.+?)\s+((?:\d+[a-z]*[,\s]+)*\d+[a-z]*)/
txt = "Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29"
matches = txt.lines.map{ |line| line.match(re).to_a[1..-1] }
p matches
#=> [["Sokolov", "19, 20, 23 ,25"],
#=> ["Hertzl", "80,82,84,86"],
#=> ["Hertzl", "80a,82b,84e,90"],
#=> ["Aba Hillel Silver", "2,3,5,6"],
#=> ["Weizman", "8"],
#=> ["Ahad Ha'am", "9 13 29"]]
The above regex says:
\A Starting at the front of the string
(…) Capture the result
.+? Find one or more characters, as few as possible that make the rest of this pattern match.
\s+ Followed by one or more whitespace characters (which we don't capture)
(…) Capture the result
(?:…)* Find zero or more of what's in here, but don't capture them
\d+ One or more digits (0–9)
[a-z]* Zero or more lowercase letters
[,\s]+ One or more commas and/or whitespace characters
\d+ Followed by one or more digits
[a-z]* And zero or more lowercase letters
However, if you want to break the number up into pieces you will need to use scan or split or the equivalent.
result = matches.map{ |name,numbers| [name,numbers.scan(/[^,\s]+/)] }
p result
#=> [["Sokolov", ["19", "20", "23", "25"]],
#=> ["Hertzl", ["80", "82", "84", "86"]],
#=> ["Hertzl", ["80a", "82b", "84e", "90"]],
#=> ["Aba Hillel Silver", ["2", "3", "5", "6"]],
#=> ["Weizman", ["8"]],
#=> ["Ahad Ha'am", ["9", "13", "29"]]]
This is because regex captures inside a repeating group do not capture each repetition. For example:
re = /((\d+) )+/
txt = "hello 11 2 3 44 5 6 77 world"
p txt.match(re)
#=> #<MatchData "11 2 3 44 5 6 77 " 1:"77 " 2:"77">
The whole regex matches the whole string, but each capture only saves the last-seen instance. In this case, the outer capture only gets "77 " and the inner capture only gets "77".
Why do you prefer not to use scan? This is what it is made for.
If you want your 3rd example to work, you need to have the [a-d] change to include the e in the range. After changing that you can use (\D{2,})\s+(\d{1,3}[a-e]?(?:[,\s]{1,3})*)*. Using the examples you gave I did some testing using Rubular.
Using some more groupings you can have the repetition on those last few conditions (which seem to be pretty tricky. This way the spacing and comma at the end will get caught in the repetition after consuming the space initially.
The only way around the limitation that you can only capture the last instance of a repeated expression is to write your regex for a single instance and let the regex machine do the repeating for you, as occurs with the global substitute options, admittedly similar to scan. Unfortunately, in that case, you have to match for either the street name or the street number and then have no way to easily associate the captured numbers with the captured names.
Regex is great at what it does, but when you try to extend its application beyond it's natural limitations, it's not pretty. ;-)
I want a regex that will find and match....
Do the street names also contain digits (0-9), other characters beside an apostrophe?
Are the street numbers based off arbitrary data? Is it always just an optional a, b, c, or d?
Are you needing a minimum and maximum limitation of string length?
Here are some possible options:
If you are unsure about what the street name contains, but know your street number pattern will be numbers with an optional letter, commas or spaces.
/^(.*?)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If the street names contain only letters with optional apostrophe's and the street numbers contain numbers with an optional letter, comma.
/^([a-zA-Z' ]+)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If your street name and street number pattern are always consistant, you could easily do.
/^([a-zA-Z' ]+)\s+([0-9a-z, ]+)$/
See working demo

stripping street numbers from street addresses

Using Ruby (newb) and Regex, I'm trying to parse the street number from the street address. I'm not having trouble with the easy ones, but I need some help on:
'6223 1/2 S FIGUEROA ST' ==> 'S FIGUEROA ST'
Thanks for the help!!
UPDATE(s):
'6223 1/2 2ND ST' ==> '2ND ST'
and from #pesto
'221B Baker Street' ==> 'Baker Street'
This will strip anything at the front of the string until it hits a letter:
street_name = address.gsub(/^[^a-zA-Z]*/, '')
If it's possible to have something like "221B Baker Street", then you have to use something more complex. This should work:
street_name = address.gsub(/^((\d[a-zA-Z])|[^a-zA-Z])*/, '')
Group matching:
.*\d\s(.*)
If you need to also take into account apartment numbers:
.*\d.*?\s(.*)
Which would take care of 123A Street Name
That should strip the numbers at the front (and the space) so long as there are no other numbers in the string. Just capture the first group (.*)
There's another stackoverflow set of answers:
Parse usable Street Address, City, State, Zip from a string
I think the google/yahoo decoder approach is best, but depends on how often/many addresses you're talking about - otherwise the selected answer would probably be the best
Can street names be numbers as well? E.g.
1234 45TH ST
or even
1234 45 ST
You could deal with the first case above, but the second is difficult.
I would split the address on spaces, skip any leading components that do not contain a letter and then join the remainder. I do not know Ruby, but here is a Perl example which also highlights the problem with my approach:
#!/usr/bin/perl
use strict;
use warnings;
my #addrs = (
'6223 1/2 S FIGUEROA ST',
'1234 45TH ST',
'1234 45 ST',
);
for my $addr ( #addrs ) {
my #parts = split / /, $addr;
while ( #parts ) {
my $part = shift #parts;
if ( $part =~ /[A-Z]/ ) {
print join(' ', $part, #parts), "\n";
last;
}
}
}
C:\Temp> skip
S FIGUEROA ST
45TH ST
ST
Ouch! Parsing an address by itself can be extremely nasty unless you're working with standardized addresses. The reason for this that the "primary number" which is often called the house number can be at various locations within the string, for example:
RR 2 Box 15 (RR can also be Rural Route, HC, HCR, etc.)
PO Box 17
12B-7A
NW95E235
etc.
It's not a trivial undertacking. Depending upon the needs of your application, you're best bet to get accurate information is to utilize an address verification web service. There are a handful of providers that offer this capability.
In the interest of full disclosure, I'm the founder of SmartyStreets. We have an address verification web service API that will validate and standardize your address to make sure it's real and allow you to get the primary/house number portion. You're more than welcome to contact me personally with questions.
/[^\d]+$/ will also match the same thing, except without using a capture group.
For future reference a great tool to help with regex is http://www.rubular.com/

Resources