Searching location in a sentence - algorithm

I'm working on a location extraction algorithm but haven't achieved anything considerable yet. For example in this sentence
Riders on the B and Q lines will get some relief from construction as stations reopen, and a major project will soon begin at the Dyckman Street station.
"Dyckman Street" is location information. How we extract this information from a given sentence. (I tried to extract the words from a sentence starting with a Capital letter and search it against a db having city names, but it doesn't work always).
From where i can find an algorithm to extract this information?
Thanks..

I remember having seen this library when I was playing with Named Entity Recognition.
This Google search might be a useful source of information as well.

There are also a number of web services designed to parse geo-locations from text. For example Yahoo's PlaceMaker service at http://developer.yahoo.com/geo/placemaker/

Related

Optimize Google Places API Query for Prominent Parks, Mountains, Conservation Areas

First post on Stackoverflow.
I am using the Google API to sort images taken while traveling into organized folders, append tags and rename files with relevant information. I have my code working well but am not always happy with the results. I want to be able to focus my query results on major tourist attractions such as National Parks, Ski Resorts, Beaches, etc. The problem I am finding is that the prominence "rankby" variable and the "radius" are not giving satisfactory results. Here is a typical query for Zion National Park.
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.269486111111,-112.948141666667&rankby=prominence&radius=50000&type=natural_feature,tourist_attraction,point_of_interest&keyword=&key=MYAPIKEY
The most prominent result is Springdale which is the town where you enter the part. Zion National Park is listed much further down in the results. What my code does is use the LAT and LON extracted using EXIF and does a Google API nearby search request to find the Place ID for where the photo was taken. It then does another API request for Place Details using the place_id provided by the previous step to cut down on the information I need to parse.
https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJ8R5RCzaNyoARegi3rqVkstk&fields=name,address_component&key=MYAPIKEY
I can force the nearby search to return a National Park by searching against "National Park" in the keywords variable but that limits my project to only being able to provide National Park results since the keywords field can only accept one string.
I would like a park of my query to be able to return the most prominent tourist attraction at the general level, i.e. Zion National Park, Yosemite National Park, etc. so I can sort images into the general name folders and another part of the query provides the exact location. i.e. I am on this trail or at this lookout. The problem is the Google API sees these specific locations "Trail, Lookout" as tourist attractions, parks, establishments, etc. as well so it chooses those first.
What I need help with is trying to figure out if there is a better way to structure my query to return the high-level / name of the major park. From my understanding, the types field only searches on the first type even if there is more in the list and the keywords field can only accept one string as well making it impossible for one phase to capture all major destinations at a high level.
Perhaps it needs to be done with more queries but I am trying to limit the number of queries to stay inside the free quota. Maybe it will just take a long time to fully sort my files.
Read through and implemented Google API structure. I hoping someone can provide a more detailed query structure or method to parse out truly prominent locations rather than googles interpretation of prominence as it can be affected by user ratings, etc. It is not always accurate.

Bing/Google/Flickr API: how would you find an image to go along each of 150,000 Japanese sentences?

I'm doing part-of-speech & morphological analysis project for Japanese sentences. Each sentence will have its own webpage. To make this page more visual, I want to show one picture which is somehow related to the sentence. For example, For the sentence "私は学生です" ("I'm a student"), the relevant pictures would be pictures of school, Japanese textbook, students, etc. What I have: part-of-speech tagging for every word. My approach now: use 2-3 nouns from every sentence and retrieve the first image from search results using Bing Images API. Note: all the sentence processing up to this point was done in Java.
Have a couple of questions though:
1) what is better (richer corpus & powerful search), Google Images API, Bing Images API, Flickr API, etc. for searching nouns in Japanese?
2) how do you select the most important noun from the sentence to do the query in Image Search Engine without doing complicated topic modeling, etc.?
Thanks!
Japanese WordNet has links to OpenClipart pictures. That could be another relevant source. They describe it in their paper called "Enhancing the Japanese WordNet".
I thought you would start by choosing any noun before は、が and を and giving these priority - probably in that order.
But that assumes that your part-of-speech tagging is good enough to get は=subject identified properly (as I guess you know that は is not always the subject marker).
I looked at a bunch of sample sentences here with this technique in mind and found it as good as could be expected. Except where none of those are used, which is rarish.
And sentences like this one, where you'd have to consider maybe looking for で and a noun before it in the case where there is no を or は. Because if you notice here, the word 人 (people) really doesn't tell you anything about what's being said. Without parsing context properly, you don't even know if the noun is person or people.
毎年 交通事故で 多くの人が 死にます
(many people die in traffic accidents every year)
But basically, couldn't you implement a priority/fallback type system like this?
BTW I hope your sentences all use kanji, or when you see はし (in one of the sentences linked to) you won't know whether to show a bridge or chopsticks - and showing the wrong one will probably not be good.

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.

How does google know if I type in redflower.jpg I mean Red Flower?

I'm curious what the programming terms or methodology is used when Google shows you the "did you mean" link for a word that is made up of multiple words?
For example if I type in "redflower.jpg" It knows to break that up into Red Flower
Is there a common paradigm for doing that sort of operation? Would a Lucene search give you that?
thanks!
If google does not see a lot of matching results for reflowers.jpg, it might then try to cut the words in multiple words until it finds a lot of matching results.
It might also recognize the extension (.jpg), recognize the image extension and then try to find images with the similar name.
If I would have to make an algorithm like this, I would use an huge EXISTING database (either a dictionary or a search engine) and then try what I said in the beginning of my post.
Perhaps they could to look at what other people do when they have searched for redflowers.jpg? Maybe a number of people searched for "redflowers.jpg", didn't click on any links, and then searched for "Red Flower" and found some results worth clicking on.
Of course they would have to take into account that the queries are similar (contain matching strings), otherwise some strange results might appear.

Programmatically find common European street names

I am in the middle of designing a web form for German and French users. Within this form, the users would have to type street names several times.
I want to minimize the annoyance to the user, and offer autocomplete feature based on common French and German street names.
Any idea where I can a royalty-free list?
Would your users have to type the same street name multiple times? Because you could easily prevent this by coding something that prefilled the fields.
Another option could be to use your user database as a resource. Query it for all the available street names entered by your existing users and use that to generate suggestions.
Of course this would only work if you have a considerable number of users.
[EDIT] You could have a look at OpenStreetMap with their Planet.osm dumbs (or have a look here for a dump containing data for just Europe). That is basically the OSM database with all the map information they have, including street names. It's all in an XML format and streets seem to be stored as Ways. There are tools (i.e. Osmosis) to extract the data and put it into a database, or you could write something to plough through the data and filter out the street names for your database.
Start with http://en.wikipedia.org/wiki/Category:Streets_in_Germany and http://en.wikipedia.org/wiki/Category:Streets_in_France. You may want to verify the Wikipedia copyright isn't more protective than would be suitable for your needs.
Edit (merged from my own comment): Of course, to answer the "programmatically" part of your question: figure out how to spider and scrape those Wikipedia category pages. The polite thing to do would be to cache it, rather than hitting it every time you need to get the street list; refreshing once every month or so should be sufficient, since the information is unlikely to change significantly.
You could start by pulling names via Google API (just find e.g. lat/long outer bounds - of Paris and go to the center) - but since Google limits API use, it would probably take very long to do it.
I had once contacted City of Bratislava about the street names list and they sent it to me as XLS. Maybe you could try doing that for your preferred cities.
I like Tom van Enckevort's suggestion, but I would be a little more specific that just looking inside the Planet.osm links, because most of them require the usage of some tool to deal with the supported formats (pbf, osm xml etc)
In fact, take a look at the following link
http://download.gisgraphy.com/openstreetmap/
The files there are all in .txt format and if it's only the street names that you want to use, just extract the second field (name) and you are done.
As an fyi, I didn't have any use for the French files in my project, but mining the German files resulted (after normalization) in a little more than 380K unique entries (~6 MB in size)
#dusoft might be onto something - maybe someone at a government level can help? I don't think that a simple list of street names cannot be copyrighted, nor any royalties be charged. If that is the case, maybe you could even scrape some mapping data from something like a TomTom?
The "Deutsche Post" offers a list with all street names in Germany:
http://www.deutschepost.de/dpag?xmlFile=link1015590_3877
They don't mention the price, but I reckon it's not for free.

Resources