Algorithms recognizing physical address on a webpage - algorithm

What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.

A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.

If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.

If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.

I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.
Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.

Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.
If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..

What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.
If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http

Again, regular expressions should do the trick.
Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc
You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work

You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction

It depends upon your requirement.
for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.
If you need information like paragraphs get the contents by using tags.

Related

Internationalisation - displaying gendered adjectives

I'm currently working on an internationalisation project for a large web application - initially we're just implementing French but more languages will follow in time. One of the issues we've come across is how to display adjectives.
Let's take "Active" as an example. When we received translations back from the company we're using, they returned "Actif(ve)", as English "Active" translates to masculine "Actif" or feminine "Active". We're unsure of how to display this, and wondered if there are any well established conventions in the web development world.
As far as I see it there are three possible scenarios:
We know at development time which noun a given adjective is referring to. In this case we can determine and use the correct gender.
We're referring to a user, either directly ("you") or in the third person. Short of making every user have a gender, I don't see a better approach than displaying both, i.e. "Actif(ve)"
We are displaying the adjective in isolation, not knowing which noun it's referring to. For example in a table of data, some rows might be dealing with a masculine entity, some feminine.
Scenarios 2 and 3 seem to be the toughest ones. Does anyone have any experience handling these issues? Any tips would be appreciated!
This is complex, because we cannot imagine all the cases, and there is risk to go in "opinion based" answer, so I keep it short and generic.
Usually I prefer to give context in translation (for translator), e.g. providing template: _("active {user_name}" (so also the ordering will be correct if languages want different ordering).
Then you may need to change code and template into _("active {first_name_feminine}") and _("active {first_name_masculine}") (and possibly more for duals, trials, plurals, collectives, honorific, etc.). Note: check that the translator will not mangle the {} and the string inside. Usually you need specific export/import scripts. Or I add a note inside the string, and I quickly translate into English removing the note to the translator). Also this can be automated (be creative on using special Unicode characters which should not be used in normal text, to delimit such text).
But if you cannot know the gender, the Actif(ve) may be the polite version used in such language. You need a native speaker test, and changes back and forth.

Parsing HTML in AppleScript

What's a good way to parse HTML in AppleScript?
I haven't dabbled in AppleScript in quite some time, and even when I did it was very minimal and uninvolved, so I don't really think naturally in the language quite yet. But I need to do some string manipulation and parse some HTML (basically some simple screen scraping).
Naturally, I'd like to avoid common pitfalls of HTML parsing. However, this is a temporary script and doesn't need to be particularly robust or supportable. I really just need to scrape specific substrings (from a known starting substring to the next known character) into a file.
I've done plenty of string manipulation in C# and similar languages, but AppleScript is an interesting change of pace to say the least. Can somebody point me to some good resources (Google searches on this subject seem to have a high noise-to-signal ratio), or help me out with some sample code snippets?
The ultimate goal of what I'm doing is to take a pre-determined list of pages, open each one in Safari (I'm doing everything through tell application "Safari"), parse out links which fit a certain pattern, and store all of those links in a file. Then go through that file, open each of those links, parse out more links which fit another pattern, and store all of those links in a file.
(The site is actually owned by someone we're working with, so don't worry about me violating any terms of service or anything like that. But for reasons outside the scope of this question, I'm doing some page scraping in AppleScript.)
I can't say enough good things about Matt Neuburg's AppleScript: the Definitive Guide. Without a doubt the most complete documentation of AppleScript ever done. Matt's also one of my favorite tech writers.
I would also check out this article. It contains a tutorial on how to do this; the example provided there parses HTML data from only one source, but I think it's worth looking at.

What is a good approach for extracting keywords from user-submitted text?

I'm building a site that allows users to make sense of a debate by graphically representing arguments for and against a particular issue. (Wrangl)
I'd like to categorise these debates so they are more easily found and connected. I don't want to irritate the person creating the debate by asking them to add tags and categories before they see any benefit, so I'm looking at a way of automatically extracting keywords.
What's a good approach for taking the debate's title and description (and possibly the content of the arguments themselves once there are some) to pull out, say, ten strong keywords that could be used as metadata to connect similar debates together, or even as the content of the "meta" keywords tag in the head of the HTML page where the debate is viewable. Eg. Datamapper vs ActiveRecord
The site is coded in Ruby with Sinatra, using DataMapper for data storage. I'm ideally looking for something which will work on Heroku (I don't have a way of writing files to disk dynamically), and I'd consider a web service, an API or ideally a Ruby gem.
Maybe you can use TextAnalyzer.
I understand that you're wanting to find an easy way of achieving this, I've recently dived into the world of NLP (Natural Language Processing) and Text-mining and its a daunting process of which most went far above my head.
Although i managed to code some functionality that resembles what you're looking for, though I did it in PHP. What i would suggest, that if you want it tailored to your project (Wrangl) then do it yourself.
Using the Porter stemming algorithm which I'm sure there will be Ruby code for.
Ruby Porter stemmer
You can try the salsaAPI to automatically extract keywords and categorize the debates!

Why do people use plain english as translation placeholders?

This may be a stupid question, but here goes.
I've seen several projects using some translation library (e.g. gettext) working with plain english placeholders. So for example:
_("Please enter your name");
instead of abstract placeholders (which has always been my instinctive preference)
_("error_please_enter_name");
I have seen various recommendations on SO to work with the former method, but I don't understand why. What I don't get is what do you do if you need to change the english wording? Because if the actual text is used as the key for all existing translations, you would have to edit all the translations, too, and change each key. Or don't you?
Isn't that awfully cumbersome? Why is this the industry standard?
It's definitely not proper normalization to do it this way. Are there massive advantages to this method that I'm not seeing?
Yes, you have to alter the existing translation files, and that is a good thing.
If you change the English wording, the translations probably need to change, too. Even if they don't, you need someone who speaks the other language to check.
You prep a new version, and part of the QA process is checking the translations. If the English wording changed and nobody checked the translation, it'll stick out like a sore thumb and it'll get fixed.
The main language is already existent: you don't need to translate it.
Translators have better context with a real sentence than vague placeholders.
The placeholders are just the keys, it's still possible to change the original language by creating a translation for it. Because when the translation doesn't exists, it uses the placeholder as the translated text.
We've been using abstract placeholders for a while and it was pretty annoying having to write everything twice when creating a new function. When English is the placeholder, you just write the code in English, you have meaningful output from the start and don't have to think about naming placeholders.
So my reason would be less work for the developers.
I like your second approach. When translating texts you always have the problem of homonyms. Like 'open' can mean a state of a window but also the verb to perform the action. In other languages these homonyms may not exist. That's why you should be able to add meaning to your placeholders. Best approach is to put this meaning in your text library. If this is not possible on the platform the framework you use, it might be a good idea to define a 'development language'. This language will add meaning to the text entries like: 'action_open' and 'state_open'. you will off course have to put extra effort i translating this language to plain english (or the language you develop for). I have put this philosophy in some large projects and in the long run this saves some time (and headaches).
The best way in my opinion is keeping meaning separate so if you develop your own translation library or the one you use supports it you can do something like this:
_(i18n("Please enter your name", "error_please_enter_name"));
Where:
i18n(text, meaning)
Interesting question. I assume the main reason is that you don't have to care about translation or localization files during development as the main language is in the code itself.
Well it probably is just that it's easier to read, and so easier to translate. I'm of the opinion that your way is best for scalability, but it does just require that extra bit of effort, which some developers might not consider worth it... and for some projects, it probably isn't.
There's a fallback hierarchy, from most specific locale to the unlocalised version in the source code.
So French in France might have the following fallback route:
fr_FR
fr
Unlocalised. Source code.
As a result, having proper English sentences in the source code ensures that if a particular translation is not provided for in step (1) or (2), you will at least get a proper understandable sentence than random programmer garbage like “error_file_not_found”.
Plus, what do you do if it is a format string: “Sorry but the %s does not exist” ? Worse still: “Written %s entries to %s, total size: %d” ?
Quite old question but one additional reason I haven't seen in the answers yet:
You could end up with more placeholders than necessary, thus more work for translators and possible inconsistent translations. However, good editors like Poedit or Gtranslator can probably help with that.
To stick with your example:
The text "Please enter your name" could appear in a different context in a different template (that the developer is most likely not aware of and shouldn't need to be). E.g. it could be used not as an error but as a prompt like a placeholder of an input field.
If you use
_("Please enter your name");
it would be reusable, the developer can be unaware of the already existing key for an error message and would just use the same text intuitively.
However, if you used
_("error_please_enter_name");
in a previous template, developers wouldn't necessarily be aware of it and would make up a second key (most likely according to a predefined wording scheme to not end up in complete chaos), e.g.
_("prompt_please_enter_name");
which then has to be translated again.
So I think that doesn't scale very well. A pre-agreed wording scheme of suffixes/prefixes e.g. for contexts can never be as precise as the text itself I think (either too verbose or too general, beforehand you don't know and afterwards it's difficult to change) and is more work for the developer that's not worth it IMHO.
Does anybody agree/disagree?

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.

Resources