Extract relevant address from string? - validation

I am developing an address matching application using Google geocoding API.
The problem is that some of the addresses in the database I am trying to validate are something like:
ATTN: Mr. THOMAS WONG 2457 Yonge St., Toronto, ON, N2S 2V5, Canada
rather than
2457 Yonge St., Toronto, ON, N2S 2V5, Canada
The first string returns null results (because it starts with a person's name), the second one will validate and return a full correct address.
My question is: What would be the right approach to this issue?
I am thinking of a way to extract only the relevant part from the address string (with some function) but maybe there are better ideas?
Thank you,
M.R.

I work at SmartyStreets and wrote the address extractor which we now offer with LiveAddress API. It's hard. There are a lot of assumptions you need to force yourself not to make, including "if the address starts with a number." (Sorry DwB -- there's a lot to consider.)
If you have US addresses, you may still find our tool useful (it's free to sign up and use, to a point). Here's another Stack Overflow post about the extraction utility: https://stackoverflow.com/a/16448034/1048862
The best way to do this would be to use an address validation service -- one that can validate delivery points and not just address ranges (which is most common, so be wary of claims to "address validation" when it's really just guessing within certain bounds).
Be aware, too, that Google does not validate addresses. It may standardize them, and will return results where the address would exist if it were real, and if it is actually valid, it's your lucky day.

If the desired part of the address always starts with a number, try this:
find the first digit in the string.
get a substring from the first digit to the end of the string.
you now have the address.
In order to parse addresses, you need to know all possible formats.
Do you need to include:
Santa, North Pole.
The Queen, Great Britian
Captian Hootberry
Bob Goldenberry, rural route 7, MN
Jackie Blam, P.O. Box 78, Hootville, OH
For a comprehensive address parsing solution, you will need to provide several algorithms for different address formats then determine which algorithm to use based on the input.

Related

Can a person have null name?

I am writing an app that has a sign-up form. This article made me doubt everything I knew about human names. My question is: does a person's name necessarily have positive length? Or can I validate names in this way and be confident that I have not denied anyone their identity?
P.S.: one might ask why am I validating at all. The answer is that this is for a school project and proper validation is a part of the mark. The article above proves that person's name can be pretty much any string of positive length but I don't know if zero length is OK.
With all types of programming, you have to draw a distinction between what is meaningful in the real world, and what is meaningful for your software solution.
How the data is to be used will validate what type of validation is required.
For instance, if your software interfaces with a government API, and the government API requires a first name and surname, you should do the same.
If you're interacting with bank accounts, you may have a single string which represents that account name, which many or may not be a human name or not, but may have other constraints around length.
If the name is only to be used for display purposes, maybe there is no point to capture the name at all, and instead you should capture a preferred display name (which doesn't needlessly assume a certain number of name components).
When writing software, you should target to make as few assumptions as possible, unless those assumptions will cause an increase in complexity of your software solution. If the software requires people to have non-empty names, then you should validate at the border that this is true.
In addition, if you were my student, you would have already lost marks for conflating null, and an empty string. In this instance, null would represent you lack data about the name, and an empty string would indicate that user has specified that their name is empty.
Also, if you decide not to validate something, you should at least leave a comment to indicate that you thought of it. If you do something unusual, it's possible a future developer may come along and fix the "bug". In addition, this helps you avoid losing marks.

what address information should I collect when developing an international signup for a website

I have used google to obtain an address from a postcode and the like before. My problem is I want my website to have address fields such that anyone in any country can sign up properly and provide all necessary address information. I will include a feature to enter postcode and obtain all other information automatically.
Is it reasonable for me to check the postcode and force a successful google lookup before someone signs up? If so I could just store the JSON string in the database as a blob or maybe inside a class. But I still need to decide what fields, such as street name, postcode or zip, and the like to include. I'm not sure where to begin deciding what to include?
I think what I'm really asking, is what fields are associated with what google fields in general. I know the different administrative levels are different things in different countries :/
While I can't say what you should specifically do for Google, I can tell you what fields our customers use when they validate international addresses online. (Full Disclosure: I'm a programmer at SmartyStreets where we validate international addresses.)
While each country's mailing system is unique, there are a few major similarities that they all share. This element of commonality is what allows you to have people enter their address into a universal form and then validate the address, regardless of the country in question.
Address Line 1: This field is usually the house or building number and the street which the building is located. Examples of this field include: 123 Main Street, Calle Proc. San Sebastián, 15, 1019 North 1300 West, etc.
Address Line 2: This field would include apartment or suite numbers.
Locality: The most common data entered for this is the city component of the address. For example: Paris, Hamburg, Johannesburg, etc.
Administrative Area: This is the state or province name or abbreviation. Examples of this would be Texas - TX, Alberta - AB, Firenze (Italy) - FI.
Postal Code (where available): Examples of this would be 90210 (Beverly Hills in California) or 84000 (Avingon in France).
While you can always add additional fields to give additional context to a software parser or interpreter, the above fields are the most common ones that you would use for international address validation. If you're not sure, you can test a non-US address for free. We offer extensive documentation that is both free and publicly visible to help better explain the nuances and idiosyncrasies of street and mailing addresses.

Regular Expression for Address/Zip/City&State

Anybody have an example of a regular expression that matches for address, zip, or [city,state]?
Update:
Admittedly, this is a weak question because I don't have enough information regarding user behavior at this point to really qualify the parameters of the problem. Here is what I'm trying to do though:
Create a search function that depending on what information has been entered in chooses one of two divergent paths, the first being address proximity search and the second being organization name search.
It is proving a difficult problem to solve, so any input out there, besides .* (okay, okay I deserved that) would be much appreciated.
Check out geocoder (http://www.rubygeocoder.com/). It will get lat/long from text input. What you could do for your search is first try to match organization names, and then try to match locations.
Luckily google figure out how to do proximity searches a while ago

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.

Algorithms recognizing physical address on a webpage

What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.
A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.
If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.
If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.
I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.
Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.
Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.
If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..
What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.
If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http
Again, regular expressions should do the trick.
Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc
You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work
You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction
It depends upon your requirement.
for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.
If you need information like paragraphs get the contents by using tags.

Resources