Google Places API: Nationwide Search based on category - google-places-api

I want to build a database of location coordinates for a particular category (say movie_theatre) for the whole country (say India) and NOT nearby places. Suppose there are 5000 such places. Here is what I am planning to use:
https://maps.googleapis.com/maps/api/place/textsearch/json?query=movie+theatre+near+india&sensor=false&key=My_key
Is there a better approach to do so? Also I cannot use the radar search as it is limited to a radius of 50km.
Thanks

In specific case of movie theaters since they are based where people are settled, it makes more sense to do search in areas where the population lives; namely cities, towns, and villages.
To do this you need to get a substantial list of cities, towns, and villages from a national census bureau (in case of India, see here and here) and then do a search for each.
Also, see this answer.

Related

Algorithm for Matching Hospital Names

I work in a health care company and I have trouble with the hospitalization report data. I have the data are coming from various sources: Excel Reports, Plain Text File, and in some cases paper. I managed to get all the data into an Excel File. But I am running into a problem where each person spelled and referred to the same hospital.
For Example: New York Presbyterian Hospital, I have seen more than 10 variation.
New York Presbyterian Hospital
NY Presbyterian Hospital
Presbyterian Hospital
Presb Hospital
PresbHosp
New_York_Presb_Hosp
NYPresbHosp
Columbia Presbyterian Medical Center
NYP/Columbia University Medical Center
New York Presbyterian Hospital Columbia University Medical
A more more cases where the hospital name is misspelled
A few of the different system string limit and cut off the string in random places, or maybe they copy and pasted incorrectly.
Different nurses refer to the Hospital in a differently
In my effect I am trying to create a true database that can store all the membership's information, but I am running into a wall because each staff/department are naming the hospital in a different way. (There is a provider ID unique to each hospital), but most of the reports I received only included "name". I have over 2000 members with about 100-150 hospitals, but 3 or 4 times the amount of different names.
I know Levenshtein distance could be in use, but in such extreme case, is there a strategy to build a match? There are too much data to do by hands (time consuming), since this is one of the dozens or reports I am assigned. Any suggestion would be appreciated.
This is a pretty standard and pretty difficult problem. Entire companies exist to solve it for big data.
The usual strategy is to encode what is known about the data domain in a heuristic algorithm to classify the data before putting it in the database.
A standard classification method would be to create a set of pattern strings for each hospital. The examples you gave might go in the pattern set initially.
Then for each incoming string and each pattern, calculate a metric that's the difference between the string and pattern. Levenshtein is a good starting point. The set containing the least distance pattern (in this case Columbia Presbyterian) wins. An excessive least distance means your pattern set is no good. (You get to tweak what "excessive" means.) More than one low distance (you get to define "low," too) means the pattern set has inadvertent overlaps.
Both problems may be handled in various ways, usually involving human intervention either to classify the data or enhance the pattern sets or both.
A second possibility is to use regexes as patterns. Then a match is equivalent to distance zero above, and a non-match is distance infinity. As you might expect, this makes the algorithm less flexible. Yet for some kinds of data - probably not yours though - it's the best choice.
You should look for "specific patterns" which your data is forming. What i have observed is, out of the strings that you've revealed-- "Presb" is the sub-string which is used in all strings (variations of hospital fields that you have been provided with). #M-ohem's comment is a nice approach as well. But for the starters, you can put up a regular expression which checks if any input string has the pattern "Persb" in it. Learn More

Sort POIs by distance from current location

Trover is an awesome app: it shows you a stream of discoveries (POIs) people have uploaded - sorted by the distance from any location you specify (usually your current location). The further you scroll through the feed, the farther away the displayed discoveries are. An indicator tells you quite accurately how far the currently shown discoveries are (see screenshots on Website).
This is different from most other location based apps that deliver their results (POIs) based on fixed regions (e.g. give me all Pizzerias withing a 10km radius) which can be implemented using a single spacial datastructure (or an SQL engine supporting spatial data types). Deliverying the results the way Trover does is considerably harder:
You can query POIs for arbitrary locations. Give Trover a location in the far East of Russia and it will deliver discoveries where the first one is 2000km away and continuously increasing from there.
The result list of POIs is not limited by some spatial range. If you scroll long enough through the feed you will probably see discoveries which are on the other side of the globe.
The above points require a semi-strict ordering of their POIs for any location. The fact that you can scroll down and reload more discoveries implies that they can deliver specific sections of the sorted data (e.g. give me the next 20 discoveries that are at least 100km away from my current location).
It's fast, the fetching and distance indications are instant. The discoveries must be pre-sorted. I don't know how many discoveries they have in their DB but it must be more than what you want to sort ad hoc.
I find these characteristics quite remarkable and wonder how this is implemented. Any suggestions what kind of data-structure, algorithms or caching might be used?
I don't get the question. What do want an answer to?
Edit:
They might use a graph-database where one edge represent the distance between the nodes. That way you can get the distance by the relationships of nearby POIs. You would calculate the distance and create edges to nearby nodes. To get the distance of an arbitrary point you just do a circle-distance calculation, for another node you just add up the edges value as they represent the distance (this is for the case of getting the walking,biking, or car calculation). The adding up might not be the closest way but will give a relative indication which it seems like they use.

Algorithms to find stuff a user would like based on other users likes

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.
I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.
Here's what I want to accomplish:
Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).
A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).
Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.
I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?
Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?
If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.
Thanks!
Matthew Podwysocki had some interesting articles on this stuff
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/03/30/functional-programming-and-collective-intelligence.aspx
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/04/01/functional-programming-and-collective-intelligence-ii.aspx
http://weblogs.asp.net/podwysocki/archive/2009/04/07/functional-programming-and-collective-intelligence-iii.aspx
This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:
user movie year genre ... | rating
---------------------------------------------
1 1 2006 action | 5
3 2 2008 drama | 3.5
...
and for an unrated movie X:
10 20 2009 drama ?
we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.
The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction.
This approach has the advantage of being very simple to easy implement from scratch.
Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background.
Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.
The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.
If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)
I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.
There are a few algorithms that are good for this:
ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)
ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages.
ARTMAP
KMeans: This seperates out the vectors by the distance that they are from each other
KMeans: Wikipedia
PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision.
PCA
The K-nearest neighbor algorithm may be right up your alley.
Check out some of the work of the top teams for the netflix prize.

Regional Proximity UI

I'm developing a UI (AJAX-enabled; LAMP server) which will allow a user to designate regions in which a company operates. A "region" in this case may be a state (if dealing with the US) a province (Canada), or entire country (everyone else).
As there are 195 countries in the world, I would like to avoid a multi-select box or list of checkboxes. In the workflow leading to this particular screen, the user will have already entered the full address of the company, so I have a starting region to work from.
Since the majority of companies only operate out of their own region, and those covering multiple regions tend not to branch out too far, I am considering displaying the list of regions gradually based on proximity. I realize at some point (I'm using 3 passes for now) the full list will need to be displayed; I'm just trying to delay the user from reaching that point as it's a definite edge case.
Here is a PNG mockup that explains this concept a bit more clearly. (196kb)
Questions:
What suggestions do you have for the actual form interaction? This has not been presented to representative end users yet, but I'm open to all suggestions during the prototyping stage.
Do you think 'rolling up' US states and/or Canadian provinces between transitions will negatively affect the user's spatial memory?
More clearly: after the 3rd pass, the company will operate in every US state - so convert those 50 inputs into one.
Are there any existing applications that have utilized this approach to use as a baseline or demo?
And, since I know my developer will want to know - what would be the easiest way to store each region's proximity? Lat/long of the center? Lat/long of each corner of a 'bounding box' (more accurate)? I'm assuming we will end up writing some proximity calculations based on the lat/long of the company's actual address.
Are you expecting users to read the map in order to know what list of checkboxes to go to? If your users have than level of geographic ability, then it’s less work for them to select the regions directly from the map, rather than have them make the map-to-Proximity-Level cognitive transfer, followed by a Proximity-Level-to-region transfer.
If some users do not have that level geographic expertise (you may be surprised how many Americans cannot find their own state on a US map), then I’d try, perhaps in addition to the map, no more than two lists, one proximal (the default) with regions close to the home address, and one exhaustive. I can’t see users with weak geographic abilities being be able to handle multiple arbitrary levels of proximity. People who can’t read maps well are not going to able to estimate the proximity level of one region to another. So the idea is to try a proximal list and if that doesn’t work, then forget about proximity and go exhaustive –don’t send your users wandering among proximity levels looking for Idaho (“I swear it’s near Indiana”).
By default, show the proximal list with regions likely to satisfy most of your users based on research of your likely clients. A “more” button displays the exhaustive list. Both lists should be sorted alphabetically, except first subdivide the exhaustive list into States of the US, Provinces & Territories of Canada, and Country (which includes the US (all) and Canada (all)).
You can provide some command buttons to select multiple regions (e.g., “All 48 contiguous US states, All of South America), allowing users to de-select some regions afterward. For this reason, I wouldn’t roll anything up until the user commits the input.
As an example of someone using a map plus list (all in HTML, no less), see http://justaddwater.dk/2007/12/21/map-with-positions-in-css/
I am not really clear what it is that you are trying to achieve from the current UI (are you looking for branch offices? other companies? etc?)
I am not a big fan of using pure geographical proximity to define regions. For example, if one company operates in NYC, it could have an office in NJ which could well be as far as the moon. On the other hand, for a company in anchorage, an office in Vancouver could still be within the region. Unfortunately, state boundaries are fairly meaningless too. For example, I live in western PA, and can tell you that while Pittsburgh and Philly are in the same state, they could be different countries for all that matters, and most companies have offices in each.
If your project is lamp based, why not just let a user click a point on the map, and based on that ask him what he means (e.g., nearest city, entire county, entire state, entire country?. If you then need to define the entire region, you can perhaps use some sort of a grab tool to click or delineate all the other regions that could be part of it?
Either way, present your offices as pushpins on the map, and then maybe have a list on the side the way that standard google maps handles searches.
It may be a lot of work, but if it's an important form, users may prefer that over manual text entry or selections from a list.

How does "Find Nearest Locations" work?

Nowadays most of the Restaurants and other businesses have a "Find Locations" functionality on their websites which lists nearest locations for a given address/Zip. How is this implemented? Matching the zipcode against the DB is a simple no-brainer way to do but may not always work, for example there may be a branch closer to the given location but could be in a different zip. One approach that comes to my mind is to convert the given zip-code/address into map co-ordinates and list any branches falling into a pre-defined radius. I welcome your thoughts on how this would've been implemented.If possible provide more detailed implementation details like any web-services used etc.,
A lot of geospatial frameworks will help you out with this. In the geospatial world, a zip code is just a "polygon", which is just an area on a map which defines clear boundaries (not a polygon in the math sense). In SQL 2008 spatial, for example, you can create a new polygon based on your original polygon. So you can dynamically create a polygon that is your zip code extended by a certain distance at every point. It takes the funky shape of the zip code into account. With an address, It’s easy, because you just create a polygon, which is a circle around the one point. You can then do queries give you all points within the new polygon that you created in either method.
A lot of these sites are basically just doing this. They give you all points within a 5 mile extended polygon, and then maybe a 10 mile extended polygon, and so on and so forth. They are not actually calculating distance. Most ma stuff on the web is not sophisticated at all.
You can see some basic examples here to get the general idea of what I'm talking about.
There is a standard zipode/location database available. Here is one version in Access format that includes the lat/long of the zipcode as well as other information. You can then use The PostgreSQL GIS extensions to do searches on the locations for example.
(assuming of course that you extract the access db and insert into a more friendly database like PostgreSQL)
First, you Geocode the address, translating it into (usually) a latitude and longitude. Then, you do a nearest-neighbour query on your database for points of interest.
Most spatial indexes don't directly support nearest-neighbour queries, so the usual approach here is to query on a bounding box of a reasonable size with the geocoded point at the center, then sort the results in memory to pick the closest ones.
Just like you said. Convert an address/ZIP into a 2D world coordinate and compare it to other known locations. Pick the nearest. :) I think some DB's (Oracle, MSSQL 2008) even offer some functions that can help, but I've never used them.
I think it is pretty universal. They take the address or zipcode and turn it in to a "map coordinate" (differs depending on implementation, probably a lat/long) and then using the "map coordinates" of the things in the database it is easy to calculate a distance.
Note that some poor implementations convert the zipcode in to a coordinate representing the center of the zipcode area, which sometimes gives bad results.
Your thoughts on how to do it are how I would probably do it. You can geocode the co-oridinated for the zip and then do calculations based on that. I know SQL Server 2008 has some special new functionality to help doing queries based on these geocoded lon/lat co-ordinates.
There are actual geometric algorithms and/or datastructures that support lower O(...) nearest location queries on point, line and/or region data.
See this book as an example of information on some of them, like: Voronoi diagrams, quadtrees, etc.
However I think the other answers here are right in most cases that you find in software today:
geocode (a single point in) the search area
bounding box query to get an initial ballpark
in memory sorting/selecting
I had table that i would compile a database table every 6months it contained 3 columns, I used it for a few clients in Australia, it contained about 40k of rows, very lightweight to run a query. this is quite quick, if just looking to get something off the ground for a client
Postal Code from
Postal Code To
Distance
SELECT Store_ID, Store_AccountName, Store_PostalCode, Store_Address, Store_Suburb, Store_Phone, Store_State, Code_Distance FROM Store, (SELECT Code_To As Code_To, Code_Distance FROM Code WHERE Code_From = #PostalCode UNION ALL SELECT Code_From As Code_To, Code_Distance FROM Code WHERE Code_To = #PostalCode UNION ALL SELECT #PostalCode As Code_To, 0 As Code_Distance) As Code WHERE Store_PostalCode = Code_To AND Code_Distance <= #Distance ORDER BY Code_Distance
There may be plenty optimization that you could do to speed up this query!.

Resources