First of all, I must say that I was very hesitant to post the following as I am afraid from getting down votes. However, I've spent days thinking about a solution and I haven't found one. My last hope is to get some answers in this post.
The Problem
Say that you have a large database of drivers connected in real-time to your backend, while you are fetching their lat/long each 5 seconds and posting it back to the backend so you update in real-time a driver's location. Let's suppose that we want to benefit form drivers and their positions to let a particular user find the closer connected driver to it (like in UBER,Lyft, etc..).
The question:
How is it possible to dispatch request to these drivers ? (I want you to share with me only you thoughts and ideas).
What you are looking is called "GeoSpatial search".
If you are looking for algorithms to implement then take a look at Nearest Neighbour Search
The most famous algorithm is k-Nearest Neighbours algorithm.
If you are only interested in using an existing implementation and build your application on top of it then there are existing databases & search applications which provide capability of GeoSpatial search.
Check Apache Solr which provides Geospatial search capabilities. https://cwiki.apache.org/confluence/display/solr/Spatial+Search
you just need to feed your drivers live location into this and query with current location of user. Solr will take care of finding the nearest drivers and you will get a search result with your matching criteria.
You can use this as a starting point to build your app with location based searches. In pratice, Uber, Lyft and other major services have their own in-house application with custom implementations.
Related
I'm a developer on a service vehicle dispatching web app. It's written in .Net 4+, MVC4, using SQL server.
There are 2000+ locations stored in the database as geography data-types. Assuming we send resources from location A to location B, the drive time / distance etc... needs to be displayed at one point. If I calculate the distance with SQL Server's STDistance it will only give me the "As the crow flies" distance. So the system will need to hit a geo spatial service like bing, Google, or ESRI and get the actual drive time or suggested routes. the problem is this is a core function and will happen ALOT.
Should I pre-populate a lookup table with pre-calculated distances or average drive times? The down side is even without adding more locations that's 4Million records to search every time the information is needed.
On top of this, most times the destination is not one of our stored geospatial coordinates and can instead be an address or long/lat point anywhere on the continent which makes pre-calculating impossible.
I'm trying to avoid performance issues having to hit some geoservies endpoint constantly.
Any suggestions on how best to approach this?
-thanks!
Having looked at these problems before, you are unlikely to be able to store them all.
it is usually against almost all of the routing providers TOS for you to cache the results. You can sometimes negotiate this ability but it costs alot.
Given that there is not a fixed set of points you are searching against, doing one calculation gives you little information for the next calculation.
I would say maybe you can store the route for pair once they have been selected so you can show that route again if needed. Once the transaction is done I would remove the route from your DB.
If you really want to cache all this or have more control over it you can use PGRouting (with Postgresql) and then obtain street data. though I doubt it is worth your effort.
I'm developing a web application that will heavily depend on its ability to make suggestions on items basing on users with similar preferences. A friend of mine told me that what I'm looking for - mathematically - is some Cluster Analysis algorithm. On the other hand, here on SO, I was told that Neo4j (or some other Graph DB) was the kind DB that I should have approached for this task (the preferences one).
I started studying both this tools, and I'm having some doubts.
For Cluster Analysis purposes it looks to me that a standard SQL DB would still be the perfect choice, while Neo4j would be better suited for a Neural Network kind of approach (although still perfectly fit for the task).
Am I missing something? Am I trying to use the wrong tools combination?
I would love to hear some ideas on the subject.
Thanks for sharing
this depends on your data. neo4j is capable to provide even complex recommendations in real-time for one particular node - let's say you want to recommend to a user some product and this can be handle within a graph db in real-time
whereas using some clustering system is the best way to do recommendations for all users at once (and than maybe save it somewhere so you wouldn't need to calculate it again).
the computational difference:
neo4j has has no initialization cost and can give you one recommendations in an acceptable time
clustering needs more time for initialization (e.g. not in seconds but most likely in minutes/hours) and is better to calculate the recommendations for the whole dataset. in fact, taking strictly the time for one calculations for a specific user this clustering can do it faster than neo4j but the big restriction is the initial initialization - thus not good for real-time application
the practical difference:
if you have mostly static data and is ok for you to do recommendations once in a time than do clustering with SQL
if you got dynamical data where the data are being updated with each interaction and is necessary for you to always provide the newest recommendation, than use neo4j
I am currently working on various topics related to recommendation and clustering with neo4j.
I'm not exactly sure what you're looking for, but depending on how you implement you data on the graph, you can easily work out clustering algorithms based on counting links to various type of nodes.
If you plan correctly you nodes and relationships, you can then identify group of nodes that share most common links to a set of category.
let me introduce Reco4J (http://www.reco4j.org), is is an open source framework that provide recommendation based on graph database source. It uses neo4j as graph database management system.
Have a look at it and contact us if you are interested in support.
It is in a really early release but we are working hard to provide extended documentation and new interesting features.
Cheers,
Alessandro
Bing's search hits are quite impressive, has Microsoft not let anyone in on behind the scenes of their search technology? Tried http://www.discoverbing.com but couldn't find the answer to my question.
Microsoft historically has used a Neural Network Ranking Function as its ranking function. The Neural Network combines the hundreds of variables that a Url will have associated with it related to ranking. Paper They would typically score more than 100 docs using a detailed ranker. Each query node will need to score its top documents in isolation and return them to the aggregator. Ranking is actually very complex and scoring algorithms are typically multileveled.
For compute jobs, factor generation etc... Microsoft Search uses SCOPE which I believe is built on top of Dryad but does not use DryadLINQ. SCOPE is basically a SQL language on top of a cluster.
Actually Microsoft is far more open about their technology in search than Google is. Microsoft Research Asia and Microsoft Research Silicon Valley
The is second-hand information, but I understand they use inverted indexes (indices?) for finding the top 100 or so results, and then they use a set of neural networks to narrow it down several times to the top 10, top 3, and then to find the first one.
They do this because they reason the first hit is what makes a user thing the search engine works or not. If you search for CNN and you don't get CNN.com as the first hit, users think the engine doesn't work.
Again, this is second-hand knowledge. I heard this from a friend who worked at MS for a while on their search team.
I am a graph/network enthusiast and this just for my curiosity :)
I am trying to model the StackOverflow community as a graph/network. Assume that the people in the SO community are nodes and that the answers given to any of the question establishes a relationship between these nodes. The relationship can be assumed to be directed(link from answer -> question) or undirected. The graph could be weighted and that the weights of the nodes could represented number of vote-ups/downs (normalized on the scale of 0 to 1).
What kind of graph/network does one end up with at any given snapshot of time? Is it scale-free? Is it a small-world? The graph is continuously evolving over a period of time and i would like to understand its structure and dynamics.
Is there a way where can i retrieve this relationship data from - may be SO APIs or some one from SO can help me out with (sample) data?
Clarification edit:
Scale-free network: A network whose degree distribution asymptotically follows a power law Small-world: A network that has sub-networks characterized by presence of connections between almost any two nodes within them and most pairs of nodes are connected by at least one short path.
To the second part of your question:
Is there a way where can i retrieve
this relationship data from - may be
SO APIs or some one from SO can help
me out with (sample) data?
Try these questions instead. There are a lot of plans to implement an API to access SO data. Some things are in change, but there are possibilities to screen-scrape the data or access them via JSON (afaik).
Is there a guide to accessing StackOverflow data programmatically?
What would you want to see in a StackOverflow API?
Are there plans for a StackOverflow API?
Try it out. Good luck!
What kind of graph/network does one end up with at any given snapshot of time? Is it scale-free? Is it a small-world? The graph is continuously evolving over a period of time and i would like to understand its structure and dynamics.
It takes only a few links between remote clusters to turn a random network into a small world one, so it's quite likely to be small world.
As to whether it's scale free, that would require there to be a few posters with lots of answers and many with only one or two. I seem to recall Jeff saying that there were lots with only one question in one of the pod-casts; you might be better off asking the question there rather than here, as he will have the data.
What searching algorithm/concept is used in Google?
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Indexing
If you want to get down to basics:
Google uses an inverted index of the Internet. What this means is that Google has an index of all pages it's crawled based on the terms in each page. For instance the term Google maps to this page, the Google home page, and the Wikipedia article for Google, amongst others.
Thus, when you go to Google and type "Google" into the search box, Google checks its index of all terms available on the Internet and finds the entry for the term "Google" and with it the list of all pages that have that term referenced in it.
For veteran users:
Google's index goes beyond your simple inverted index, however. This is why Google is the best. Google's crawlers (spiders) are smart. Very smart. Beyond just keeping track of the terms that are on any given web page, they also keep track of words that are on related pages and link those to the given document.
In other words, if a page has the term Google in it and the page has a link to or is linked from another web page, the other page may be referenced in the index under the term Google as well. All this and more go into why a given page is returned for a given query.
If you want to go into why pages are ordered the way they are in your search results, that gets into even more interesting stuff.
Ranking
To get down to basics:
Perhaps one of the most basic algorithms a search engine can use to sort your results is known as term frequency-inverse document frequency (tf-idf). Simply put, this means that your results will be ordered by the relative importance of your search terms in the document. In other words, a document that has 10 pages and lists the word Google once is not nearly as important as a document that has 1 page and lists the word Google ten times.
For veteran users:
Again, Google does quite a bit more than your basic search engine when it comes to ranking results. Google has implemented the aforementioned, patented, PageRank algorithm. In short form, PageRank enhances the tf-idf algorithm by taking into account the populatirty/importance of a given page. At this point, popularity/importance may be judged by any number of factors that Google just wont tell us. However, at the most basic of levels, Google can tell that one page is more important than another because loads and loads of other pages link to it.
Google's patented PigeonRank™
Wow, they initially posted this 7 years ago from Wednesday ...
PageRank is a link analysis algorithm used by Google for the search engine, but the patent was assigned to Stanford University.
I think "The Anatomy of a Large-Scale Hypertextual Web Search Engine" is a little outdated.
Hier a recent talk about scalability: Challenges in Building Large-Scale Information Retrieval Systems
Inverted index and MapReduce is the basics of most search engines (I believe). You create an index on the content and run queries against that index to display relevance. Google however does much more than just a simple index of where each word occurs, they also do how many times it appeared, where it appears, where it appears in relation to other words, the ordering, etc. Another simple concept that's used is "stop words" which may include things like "and", "the", and so on (basically "simple" words that occur often and generally not the focus of a query). In addition, they employ things like Page Rank (mentioned by TStamper) to order pages by relevance and importance.
MapReduce is basically taking one job and dividing it into smaller jobs and letting those smaller jobs run on many systems (in parts for scalability and in parts for speed). If I recall correctly, Google was able to make use of "average" computers to distribute jobs to instead of server-grade computers. Since the processing capability of one computer is reaching a peak, many technology are heading towards cloud computing where a job is done by many physical machines.
I'm not sure how much searching Google does, it's more accurately crawling. The difference lies in that they just start at specific points and crawl to anything reachable and repeat until they hit some sort of dead-end.
While being interested in the page rank algorithm and similar I was disturbed to discover that the introduction of personal search at the turn of the year (not widely commented on) seems to change quite a lot - see Failure of the Google Gold Standard and
Google’s Personalized Results
This question cannot be answered canonically. The Algorithms used by Google (and other search engines) are their closest guarded secrets and change constantly. Every correct answer can be invalid a month or a year later.
(I know this doesn't really answer the question, but that's the point, there is no possible answer.)