Best solution for filtering - performance

I'm new to graphDB and I'm studying to create a good data model.
I have to manage 10 millions of "Contacts" and I would like to filter them by "gender". I create a POC and all is fine but I don't understand/find if the best solution is to save the gender as vertex:
or as a field on the contacts vertex:
I know that each edge will impact on the data size, but I don't find any reference on performance diff on these two types of data management.
Do you know the right approach?

In this use case, I would put gender as a property on the vertex and add an index on that property to get your answer. While having gender as a separate vertex is more correct from a theoretical perspective it has a few practical issues that leads me to suggest the second approach.
The first model you suggest will introduce a supernode into your graph. A supernode is a node with a disproportionately high number of incident edges. The Gender vertex will have a low selectivity (Male/Female/Unknown) so each vertex will have a branching factor that is in the millions. This level of a branching factor will likely cause all sorts of performance problems resulting in a slow query. Denormalizing the gender on to the vertex and adding an index should resolve most of these issues. The only issue that is likely to remain is the amount of time it will take to return the 3-5 million records you will likely receive.
In the first approach answering the question of "What is a person's gender?" would require traversing out from the contact vertex to the edge to the gender vertex which would be slower than just pulling back the contact vertex. Assuming this is a frequent query you would want to answer than this is a consideration you should take into account.

Related

What is the best data structure to store distance sensitive spatial data?

Consider I have the data about meteo station coordinates, historical temperature values and city center coordinates worldwide. Meteo stations are be placed on different distances to the city centers. The task is to determine average historical values of city temperature by the data of meteo stations.
To solve it for each city I need to find the set of closest meteo stations in some radius and average their data. The brutforce way is to calculate distances from each city to each meteo station, but it's too slow for my data. So I thought some tree data structure could help here. I tried to use R-trees to partition meteo stations by coordinates but there is a problem - such approach allows me to find meteo stations in some tree node, but it doesn't give me information about neighbouring nodes, to calculate radius condition fast (for example in case city is very close to the R-tree's node's border).
Is there a standard tree data structure which allows to search needed node fast, but also provides the set of spatial neighbours on the same tree level?
You should probably not be concerned about 'neighbours on same level' or such, this information does not necessarily mean much. I think you should probably
Decide whether you want all weather stations within a given distance (range query) or the closest k weather stations.
Then I would just use the API of the index that you are using to find the stations.
Then calculate the distance.
R-Trees are okay for that, but they are usually quite slow for loading. If loading times are a problem, you may want to try R+tree, R*tree, or maybe Quadtrees (for small datasets) or PH-Tree (for large datasets, my implementation in Java).
How the data is organsized inside the tree should not be a concern. Who ever implemented the tree probably implemented the most efficient way of finding desired neighbours.
How about using a database? And then query it to find points close to some particular point? A lot of database already support geospatial data, which you can index and query:
posgis
mongodb geospatial
mysql extension

Shortest distance/path between household addresses

If you wanted to know the shortest distance/path between two household addresses, which data structure(s) would you use to return the answer efficiently?
Say you are considering the set of all households in the United States (~100 million).
I am struggling to come up with a practical data structure considering the input size is so big. Dijkstra's seems too inefficient, but I'm guessing there is a way to preprocess the paths to make such a query possible. I'm just not sure where to start.
Dijkstra's algorithm or something very similar it probably is the basis, although you can expect that it's highly optimized. If you put high weights on residential streets and reduce the weight as the roads' capacities increase, you narrow the search space pretty quickly.
You can also expect that there are pre-computed routes between major cities. So if you're in Miami and you want to get to Los Angeles, most of the route is pre-computed. You just need to figure out how to get from the house in Miami to the nearest highway interchange, and from the highway in Los Angeles to the destination.
Consider that the number of ZIP codes is less than 100,000, so it's not unthinkable to have a table that has pre-computed routes from every ZIP code to every other ZIP code. We're only talking 10 billion routes. Stored naively, that'd be a fair amount of data, but it's highly compressible. Consider, for example, if your ZIP code database just contained the route to the nearest major highway. Once you're on the major highways, the amount of data just isn't that large.
Although all of the roads are connected, it's not like you'd treat it as one huge graph. Rather, you have a bunch of smaller graphs--clusters--and you compute the routes between clusters. You'd also have clusters within clusters until the data gets to a manageable size.
At least, that's how I'd go about solving the problem.
The A* algorithm may be used here.
It's essentially an extension of Dijkstra's algorithm, where you add a 'heuristic' to each node's value, which is the estimated distance to the destination.
In this specific case, assuming you have access to the coordinates of each house, you can determine the straight-line distance to the destination as the heuristic.
Visualization:
Beyond this, Jim's suggestions are also good.

Marauders dilemma algorithm

I'm making this repost after the earlier one here with more details.
PROBLEM :
The problem consists of a marauder who has to travel to different cities spread over a map. The starting location is known. Each city has a fixed loot associated with it. The aim of marauder is to travel across various nature of terrain. By nature of terrain, I mean there is a varied cost of travel between each pair of cities. He has to maximize the booty gained.
What we have done:
We have generated an adjacancy matrix (booty-path cost in place for each node) and then employed a heuristic analysis. It gave some output which is reasonable.
Now, the problem now is that each city has few or more vehicles in them, which can be bought (by paying) and can be used to travel. What vehicle does in actual is that it reduces the path cost. Once a vehicle is bought, it remains upto the time when next vehicle is bought. It is to upto to decide whether to buy the vehicle or not and how.
I need help at this point. How to integrate the idea of vehicle into what we already have? Plus, any further ideas which may help us to maximize the profit. I can post the code, if required. Thanks!
One way to do it would be to have a directed edge bearing the cost of the vehicle towards a duplicate graph with the reduced costs. You can even make it so that the reduction is finer than just a percentage if you want to.
The downside is that this will probably increase the size of the graph a lot (as many copies as you have different vehicles, plus the links between them), and if your heuristic is not optimal, you may have to modify it so that it considers the new edge positively.
It sounds as though beam search would suit this problem. Beam search uses a heuristic function H and a parameter k and works like this:
Initialize the set S to the initial game position.
Set T to the empty set.
For each game position in S, generate all possible successor positions to S after one move by the marauder. (A move being to loot, to purchase a vehicle, to move to an adjacent city, or whatever else a marauder can do.) Add each such successor position to the set T.
For each position p in T, evaluate H(p) for a heuristic function H. (The heuristic function can take into account the amount of loot, the possession of a vehicle, the number of remaining unlooted cities, and whatever else you think is relevant and easy to compute.)
If you've run out of search time, return the best-scoring position in T.
Otherwise, set S to the best-scoring k positions in T and go back to step 2.
The algorithm works well if you store T in the form of a heap with k elements.

What data do I need to implement k nearest neighbor?

I currently have a reddit-clone type website. I'm trying to recommend posts based on the posts that my users have previously liked.
It seems like K nearest neighbor or k means are the best way to do this.
I can't seem to understand how to actually implement this. I've seen some mathematical formulas (such as the one on the k means wikipedia page), but they don't really make sense to me.
Could someone maybe recommend some pseudo code, or places to look so I can get a better feel on how to do this?
K-Nearest Neighbor (aka KNN) is a classification algorithm.
Basically, you take a training group of N items and classify them. How you classify them is completely dependent on your data, and what you think the important classification characteristics of that data are. In your example, this may be category of posts, who posted the item, who upvoted the item, etc.
Once this 'training' data has been classified, you can then evaluate an 'unknown' data point. You determine the 'class' of the unknown by locating the nearest neighbors to it in the classification system. If you determine the classification by the 3 nearest neighbors, it could then be called a 3-nearest neighboring algorithm.
How you determine the 'nearest neighbor' depends heavily on how you classify your data. It is very common to plot the data into N-dimensional space where N represents the number of different classification characteristics you are examining.
A trivial example:
Let's say you have the longitude/latitude coordinates of a location that can be on any landmass anywhere in the world. Let us also assume that you do not have a map, but you do have a very large data set that gives you the longitude/latitude of many different cities in the world, and you also know which country those cities are in.
If I asked you which country the a random longitude latitude point is in, would you be able to figure it out? What would you do to figure it out?
Longitude/latitude data falls naturally into an X,Y graph. So, if you plotted out all the cities onto this graph, and then the unknown point, how would you figure out the country of the unknown? You might start drawing circles around that point, growing increasingly larger until the circle encompasses the 10 nearest cities on the plot. Now, you can look at the countries of those 10 cities. If all 10 are in the USA, then you can say with a fair degree of certainty that your unknown point is also in the USA. But if only 6 cities are in the USA, and the other 4 are in Canada, can you say where your unknown point is? You may still guess USA, but with less certainty.
The toughest part of KNN is figuring out how to classify your data in a way that you can determine 'neighbors' of similar quality, and the distance to those neighbors.
What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around "recommender + reddit", e.g. Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial.
k-Means clustering in its simplest form is averaging values and keep other average values around one central average value. Suppose you have the following values
1,2,3,4,6,7,8,9,10,11,12,21,22,33,40
Now if I do k-means clustering and remember that the k-means clustering will have a biasing (means/averaging) mechanism that shall either put values close to the center or far away from it. And we get the following.
cluster-1
1,2,3,4,5,6,7,8
cluster-2
10,11,12
cluster-3
21,22
cluster-4
33
cluster-5
40
Remember I just made up these cluster centers (cluster 1-5).
So the next, time you do clustering, the numbers would end up around any of these central means (also known as k-centers). The data above is single dimensional.
When you perform kmeans clustering on large data sets, with multi dimension (A multidimensional data is an array of values, you will have millions of them of the same dimension), you will need something bigger and scalable. You will first average one array, you will get a single value, like wise you will repeat the same for other arrays, and then perform the kmean clustering.
Read one of my questions Here
Hope this helps.
To do k-nearest neighbors you mostly need a notion of distance and a way of finding the k nearest neighbours to a point that you can afford (you probably don't want to search through all your data points one by one). There is a library for approximate nearest neighbour at http://www.cs.umd.edu/~mount/ANN/. It's a very simple classification algorithm - to classify a new point p, find its k nearest neighbours and classify p according to the most popular classes amongst those k neighbours.
I guess in your case you could provide somebody with a list of similar posts as soon as you decide what nearest means, and then monitor click-through from this and try to learn from that to predict which of those alternatives would be most popular.
If you are interested in finding a particularly good learning algorithm for your purposes, have a look at http://www.cs.waikato.ac.nz/ml/weka/ - it allows you to try out a large number of different algorithms, and also to write your own as plug-ins.
Here is a very simple example of KNN for the MINST dataset
Once you are able to calculate distance between your documents, the same algorithm would work
http://shyamalapriya.github.io/digit-recognition-using-k-nearest-neighbors/

A special case of grouping coordinates

I'm trying to write a program to place students in cars for carpooling to an event. I have the addresses for each student, and can geocode each address to get coordinates (the addresses are close enough that I can simply use euclidean distances between coordinates.) Some of the students have cars and can drives others. How can I efficiently group students in cars? I know that grouping is usually done using algorithms like K-Mean, but I can only find algorithms to group N points into M arbitrary-sized groups. My groups are of a specific size and positioning. Where can I start? A simply greedy algorithm will ensure the first cars assigned have minimum pick-up distance, but the average will be high, I imagine.
Say that you are trying to minimize the total distance traveled. Clearly traveling salesman problem is a special instance of your problem so your problem is NP-hard. That puts us in the heuristics/approximation algorithms domain.
The problem also needs some more specification, for example howmany students can fit in a given car. Lets say, as many as you want.
How about you solve it as a minimum spanning tree rooted at the final destination. Then each student with the car is is responsible for collecting all its children nodes. So the total distance traveled in at most 2x the total length of spanning tree which is a 2x bound right there. Of course this is ridiculous 'coz the nodes next to root will be driving a mega bus instead of a car in this case.
So then you start playing the packing game where you try to fill the cars greedily.
I know this is not a solution, but this might help you specify the problem better.
This is an old question, but since I found it, others will as well.
Group students together by distance. Find the distance between all sets of two students. Start with the closest students and add them in a group, and continue adding until all students are in groups. If students are beyond a threshold distance, like 50 miles, don't combine them into a group (this will cause a few students to go solo). If students have different sized cars, stop adding them when the max car size has been reached between the students in the group (and whichever one you're trying to add).
Finding the optimal (you asked for efficient) solution would require a more defined problem, which it seems like you don't have. If you wanted to eliminate individual drivers though, taking the above solution and special casing the outliers, working them individually into groups and swapping people around adjacent groups to fit them in, could find a very strong solution.

Resources