Generating vector data (points) for OpenLayers Cluster - cluster-computing

In my web application I am going to use OpenLayers.Strategy.AnimatedCluster strategy due to the fact that I need to visualize a great amount of point features. Here is a very good example of what it looks like. In both examples in above mentioned example the data (point features) are generated of taken from the GeoJSON file.
So, can anybody provide me with a file containing 100 000+ (better is even 500 000+) features (world cities, for instance), or explain how I can generate them so that they will be located all over the world (not like in Spain in the first example in above mentioned link).

use a geolocation database to supply you the data you need. GeoLite, for example
If 400K+ locations is ok, use download their CSV CITY LIST
If you want more, then you might want to give the Nominatim downloads, but they are quite bulky (more than 25GB) and parsing data is not as simple as a csv file.

Related

Information Retrieval Get place name by image

I am starting the development of a software in which through an image of a touristic spot (for example: San Peter Basilica, the Colosseum, etc.) I should retrieve which is the name of the spot (plus its related information). In addition to the image I will have with me the picture coordinates (embedded as metadata). I know I can support me with Google Images API using reverse search in which I give my image as an input, and I will have as a response a big set of images.
However, my advice request for you, is that now having all the similar images, which approach can I make in order to retrieve the correct place name which is in the photo.
A second approach that I am managing is to construct my own dataset in my database, and do my own heuristic (filtering images by their location and then to make the comparation over the resulting subset after having done that filtering). Suggestions and advices are heard, and thanks in advance.
An idea is to use the captions of the images (if available) as a query, retrieve a list of candidates and make use of a structured knowledge base to deduce the location name.
The situation is lot trickier if there're no captions associated with the images, in which case, you may use the fc7 layer output of a pre-trained convolutional net and query into the ImageNet to retrieve a ranked list of related images. Since those images have captions, you could again use them to get the location name.

Is there an OSM XAPI tag/value list?

I'm new to OSM querying, but would like to query vector data for a large area. Thus I need to limit the results I would like to get by tagging the request.
http://www.informationfreeway.org/api/0.6/way[tag=value][bbox=x,y,z,j]
I'd like to filter for specific tag/values when querying for a way. Though I don't know which tags/values exist. Is there a list listing the most common of them?
You are approaching your problem from the wrong direction. The number of different tags is almost unlimited. According to taginfo there are currently 75 380 856 different tags. I'm pretty sure you are not interested in most of them. Likewise you are probably not even interested in many of the most common tags.
What data do you want to query?
The OSM wiki should be your starting point for generating a list of tags you are interested in. For a generic overview take a look at the map features. Are you interested in streets? Then visit at the highway key. Routing? Then take a look at the routing wiki page.
Always remember that these lists aren't complete. People can use any tag they like (but should use well-established tags whenever possible of course).
Also consider using Overpass API instead of XAPI. Overpass API is much more powerful.

Import data from URL

The St. Louis Federal Reserve Bank has a great set of data available on a variety of their web pages, such as:
http://research.stlouisfed.org/fred2/series/OILPRICE/downloaddata?cid=32217
http://www.federalreserve.gov/releases/h10/summary/default.htm
http://research.stlouisfed.org/fred2/series/DGS20
The data sets get updated, some as often as daily. I tend to have an interest in the daily data (see the above settings on the URLS)
I'd like to import these kinds of price or rate data streams (accessible as CSV or Excel files at the above URLs) directly into Mathematica.
I've looked at the documentation on Importing[] but I find scant documentation (actually none) on how to go about something like this.
It looks like I need to navigate to the pages, send some data to select specific files and formats, trigger the download, then access the downloaded data from my own machine. Even better if I could access the data directly from the sites.
I had hoped Wolfram Alpha might make this sort thing easy, but I haven't had any success.
FinancialData[] would seem natural for this sort of thing, but I don't see anyway to do it. Financial data has lots of features, but I don't see a way yo get this sort of thing.
Does anyone have any experience with this or can someone point me in the right direction?
You can Import directly from a URL. For example, the data from federalreserve.gov can be obtained and visualized as follows.
url = "http://www.federalreserve.gov/datadownload/Output.aspx?";
url = url<>"rel=H10&series=a660e724c705cea4b7bd1d1b85789862&lastObs=&";
url = url<>"from=&to=&filetype=csv&label=include&layout=seriescolumn";
data = Import[url, "CSV"];
DateListPlot[data[[7 ;;]], Joined -> True]
I broke up url for convenience, since it's so long. I had to examine the contents of data before I knew exactly how to plot it - a step that is typically necessary. I'm sure that the data from stlouisfed.org can be obtained in a similar way, but it requires the use of an API with key to access it.
As Mark said, you can get the data directly from a URL. Your oil data can be imported from a different URL than you had:
http://research.stlouisfed.org/fred2/data/OILPRICE.txt
With that URL, you can do this:
oil = Import["http://research.stlouisfed.org/fred2/data/OILPRICE.txt",
"Table", "HeaderLines" -> 12, "DateStringFormat" -> {"Year", "Month", "Day"}];
DateListPlot[oil, Joined -> True, PlotRange -> All]
Note that "HeaderLines"->12 option strips off the header text in the first 12 lines (you have to count the header lines to know how many to remove). I've also specified the date format.
To find that URL, do as you did before, but click on a data series and then choose View Data from the menu on the left when you see the chart.
The documentation has a short example on extracting data out of a webpage:
http://reference.wolfram.com/mathematica/howto/CleanUpDataImportedFromAWebsite.html
Of course, what actually needs to be done will vary significantly from page to page.
discussion on how to do this with your API key here:
http://library.wolfram.com/infocenter/MathSource/7583/
the function is based on the API documentation. I haven't looked at the code for a couple of years and from memory I put it together rather quickly but I have used it regularly for over 2 years without problems. Here is an example for monthly non seasonally adjusted retail sales from early 1992 to now:
wolfram alpha also uses FRED data so you could use that as an alternative to direct import but it is more tricky to get the query right. I prefer to use FRED directly. Also from memory the data is only available on alpha the day after the release, which is not what you would typically want.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.

Programmatically find common European street names

I am in the middle of designing a web form for German and French users. Within this form, the users would have to type street names several times.
I want to minimize the annoyance to the user, and offer autocomplete feature based on common French and German street names.
Any idea where I can a royalty-free list?
Would your users have to type the same street name multiple times? Because you could easily prevent this by coding something that prefilled the fields.
Another option could be to use your user database as a resource. Query it for all the available street names entered by your existing users and use that to generate suggestions.
Of course this would only work if you have a considerable number of users.
[EDIT] You could have a look at OpenStreetMap with their Planet.osm dumbs (or have a look here for a dump containing data for just Europe). That is basically the OSM database with all the map information they have, including street names. It's all in an XML format and streets seem to be stored as Ways. There are tools (i.e. Osmosis) to extract the data and put it into a database, or you could write something to plough through the data and filter out the street names for your database.
Start with http://en.wikipedia.org/wiki/Category:Streets_in_Germany and http://en.wikipedia.org/wiki/Category:Streets_in_France. You may want to verify the Wikipedia copyright isn't more protective than would be suitable for your needs.
Edit (merged from my own comment): Of course, to answer the "programmatically" part of your question: figure out how to spider and scrape those Wikipedia category pages. The polite thing to do would be to cache it, rather than hitting it every time you need to get the street list; refreshing once every month or so should be sufficient, since the information is unlikely to change significantly.
You could start by pulling names via Google API (just find e.g. lat/long outer bounds - of Paris and go to the center) - but since Google limits API use, it would probably take very long to do it.
I had once contacted City of Bratislava about the street names list and they sent it to me as XLS. Maybe you could try doing that for your preferred cities.
I like Tom van Enckevort's suggestion, but I would be a little more specific that just looking inside the Planet.osm links, because most of them require the usage of some tool to deal with the supported formats (pbf, osm xml etc)
In fact, take a look at the following link
http://download.gisgraphy.com/openstreetmap/
The files there are all in .txt format and if it's only the street names that you want to use, just extract the second field (name) and you are done.
As an fyi, I didn't have any use for the French files in my project, but mining the German files resulted (after normalization) in a little more than 380K unique entries (~6 MB in size)
#dusoft might be onto something - maybe someone at a government level can help? I don't think that a simple list of street names cannot be copyrighted, nor any royalties be charged. If that is the case, maybe you could even scrape some mapping data from something like a TomTom?
The "Deutsche Post" offers a list with all street names in Germany:
http://www.deutschepost.de/dpag?xmlFile=link1015590_3877
They don't mention the price, but I reckon it's not for free.

Resources