I am in the middle of designing a web form for German and French users. Within this form, the users would have to type street names several times.
I want to minimize the annoyance to the user, and offer autocomplete feature based on common French and German street names.
Any idea where I can a royalty-free list?
Would your users have to type the same street name multiple times? Because you could easily prevent this by coding something that prefilled the fields.
Another option could be to use your user database as a resource. Query it for all the available street names entered by your existing users and use that to generate suggestions.
Of course this would only work if you have a considerable number of users.
[EDIT] You could have a look at OpenStreetMap with their Planet.osm dumbs (or have a look here for a dump containing data for just Europe). That is basically the OSM database with all the map information they have, including street names. It's all in an XML format and streets seem to be stored as Ways. There are tools (i.e. Osmosis) to extract the data and put it into a database, or you could write something to plough through the data and filter out the street names for your database.
Start with http://en.wikipedia.org/wiki/Category:Streets_in_Germany and http://en.wikipedia.org/wiki/Category:Streets_in_France. You may want to verify the Wikipedia copyright isn't more protective than would be suitable for your needs.
Edit (merged from my own comment): Of course, to answer the "programmatically" part of your question: figure out how to spider and scrape those Wikipedia category pages. The polite thing to do would be to cache it, rather than hitting it every time you need to get the street list; refreshing once every month or so should be sufficient, since the information is unlikely to change significantly.
You could start by pulling names via Google API (just find e.g. lat/long outer bounds - of Paris and go to the center) - but since Google limits API use, it would probably take very long to do it.
I had once contacted City of Bratislava about the street names list and they sent it to me as XLS. Maybe you could try doing that for your preferred cities.
I like Tom van Enckevort's suggestion, but I would be a little more specific that just looking inside the Planet.osm links, because most of them require the usage of some tool to deal with the supported formats (pbf, osm xml etc)
In fact, take a look at the following link
http://download.gisgraphy.com/openstreetmap/
The files there are all in .txt format and if it's only the street names that you want to use, just extract the second field (name) and you are done.
As an fyi, I didn't have any use for the French files in my project, but mining the German files resulted (after normalization) in a little more than 380K unique entries (~6 MB in size)
#dusoft might be onto something - maybe someone at a government level can help? I don't think that a simple list of street names cannot be copyrighted, nor any royalties be charged. If that is the case, maybe you could even scrape some mapping data from something like a TomTom?
The "Deutsche Post" offers a list with all street names in Germany:
http://www.deutschepost.de/dpag?xmlFile=link1015590_3877
They don't mention the price, but I reckon it's not for free.
Related
First post on Stackoverflow.
I am using the Google API to sort images taken while traveling into organized folders, append tags and rename files with relevant information. I have my code working well but am not always happy with the results. I want to be able to focus my query results on major tourist attractions such as National Parks, Ski Resorts, Beaches, etc. The problem I am finding is that the prominence "rankby" variable and the "radius" are not giving satisfactory results. Here is a typical query for Zion National Park.
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.269486111111,-112.948141666667&rankby=prominence&radius=50000&type=natural_feature,tourist_attraction,point_of_interest&keyword=&key=MYAPIKEY
The most prominent result is Springdale which is the town where you enter the part. Zion National Park is listed much further down in the results. What my code does is use the LAT and LON extracted using EXIF and does a Google API nearby search request to find the Place ID for where the photo was taken. It then does another API request for Place Details using the place_id provided by the previous step to cut down on the information I need to parse.
https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJ8R5RCzaNyoARegi3rqVkstk&fields=name,address_component&key=MYAPIKEY
I can force the nearby search to return a National Park by searching against "National Park" in the keywords variable but that limits my project to only being able to provide National Park results since the keywords field can only accept one string.
I would like a park of my query to be able to return the most prominent tourist attraction at the general level, i.e. Zion National Park, Yosemite National Park, etc. so I can sort images into the general name folders and another part of the query provides the exact location. i.e. I am on this trail or at this lookout. The problem is the Google API sees these specific locations "Trail, Lookout" as tourist attractions, parks, establishments, etc. as well so it chooses those first.
What I need help with is trying to figure out if there is a better way to structure my query to return the high-level / name of the major park. From my understanding, the types field only searches on the first type even if there is more in the list and the keywords field can only accept one string as well making it impossible for one phase to capture all major destinations at a high level.
Perhaps it needs to be done with more queries but I am trying to limit the number of queries to stay inside the free quota. Maybe it will just take a long time to fully sort my files.
Read through and implemented Google API structure. I hoping someone can provide a more detailed query structure or method to parse out truly prominent locations rather than googles interpretation of prominence as it can be affected by user ratings, etc. It is not always accurate.
I'm using the places api to search for places and quickly copy the some customer information into the front-end of my application.
But sometimes, the name gets translated to English, which is not desirable for the user. How can I avoid the translation of certain places?
For instance: Krankenhaus Göttlicher Heiland will be translated to: Hospital of the Divine Savior. It kinda sounds cool, but it's not what the user wants... :)
I know it takes default my browser settings and that you can add language on in the parameters, but I want to turn that off completely. So get the original native names.
Any idea?
I wrote a Ruby script that appended "data" to the beginning of every word of the English dictionary, and then filtered out various strings using different parameters, and now I want to use a site like namecheap or gandi.net in order to take each of these strings and insert them into the domain name availability checker in order to determine which ones are available.
It is my understanding that this will involve making a POST HTTP request of some kind, as well as grabbing the element in question, but I don't really understand the dynamics of what to read about in order to do this kind of thing.
I imagine that after a few requests I will be limited, but as a learning exercise I am still curious as to how I would go about doing this.
I inspected the element (on namecheap) to see what the tag looked like, to find any uniquely identifiable class/id names that I could use to grab that specific part of the source, and found that inside a fieldset tag, there was a line of HTML that I can't seem to paste here, so here is a picture:
Thanks in advance for any guidance in helping me learn about web scripting!
assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing?
I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm looking for some fresh new ideas on how to create a Remember The Milk API like to parse user's input in order to provide an input form for fast data entry that are not based on fields but in simple one line phrases instead.
EDIT: RTM is todo list system. So in order to enter a task you don't need to type in each field to fill values (task name, due date, location, etc). You can simply type in a phrase like "Dentist appointment monday at 2PM in WhateverPlace" and it will parse it and fill all fields for you.
I don't have any kind of technical constraints since it's going to be a personal project but I'm more familiar with .NET world. Actually, I'm not sure this is a matter of language but if it's necessary I'm more than willing to learn a new language to do it.
My project is related to personal finances so the phrases are more like "Spent 10USD on Coffee last night with my girlfriend" and it would fill location, amount of $$$, tags and other stuff.
Thanks a lot for any kind of directions that you might give me!
This does not appear to require full NLP. Simple pattern-based information extraction will probably suffice. The basic idea is to tokenize the text, then recognize/classify certain keywords, and finally recognize patterns/phrases.
In your example, tokenizing gives you "Dentist", "appointment", "monday", "at", "2PM", "in", "WhateverPlace". Your tool will recognize that "monday" is a day of the week, "2PM" is a time, etc. Finally, you can find patterns like [at] [TIME] and [in] [Place] and use those to fill in the fields.
A framework like GATE may help, but even that may be a larger hammer than you really need.
Have a look at NLTK, its a good resource for beginner programmers interested in NLP.
http://www.nltk.org/
It is written in python which is one of the easier programming languages.
Now that I understand your problem, here is my solution:
You can develop a kind of restricted vocabulary, in which all amounts must end witha $ sign or any time must be in form of 00:00 and/or end with AM/PM, regarding detecting items, you can use list of objects from ontology such as Open Cyc. Open Cyc can provide you with list of all objects such beer, coffee, bread and milk etc. this will help you to detect objects in the short phrase. Still it would be a very fuzzy approach.
To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.