Matching geopoints with Hadoop toolset - hadoop

I have two datasets, let's say checkins and POIs, and I have to join them based on geo-coordinates: let's say, if user was seen in radius of N km near POI, I need to join them (in other words, I want to collect all users near each POI for further manipulations). But I have an issue with this geo matching...
Initially I see two different opportunity:
1) implement LSH (locality sensitive hashing) - looks really complicated and performance might suffer as well
2) split all map in regions (2D-matrix) and then calculated how many regions are within N km distance from checkin or POI - then emit all of them - in result some deduplication must be applied - so, not sure that it is valid algo at all
any best practices?

Interesting problem.
I assume you have considered the naive brute-force approach and found it too time-consuming for your purposes. In the brute-force approach you calculate distances between each of n POIs and each of m checkins, leading to time complexity of O(n*m).
The most simple heuristic I can think of that is also applicable in Spark is to reduce a full linear scan of one dataset by grouping the dataset elements into buckets. Something like this:
case class Position(x: Double, y: Double)
val checkins: RDD[Position] = ???
val radius = 10
val checkinBuckets = checkins.groupBy(pos => (pos.x/radius).toInt)
Instead of a full linear scan, one can search only the corresponding, the next and the previous bucket. If necessary, one can create the second level by grouping the buckets to further speedup the lookup. Also, one should take care of details like correct rounding of pos.x/radius, gps distance calculation, etc.
Of course, you can always dive into various approaches for the nearest neighbour search problem as proped by #huitseeker. Also, this paper has a nice intro intro NNS.

Related

K-Means Clustering Paritioning

I am using matlab and i have this very very big .mat file named MeansOfK that contains almost 5,000,000 x N. My test data consists of Car and Non-car. My problem is that when i try to use k-means to the MeansofK. It always runs out of memory.
[idx, ctr] = kmeans(MeansOfK , k, 'Distance', 'sqEuclidean');
My Options are
1.i use the divide and conquer technique wherein i partition the car and non-car to smaller partitions and put it into k-means.
2.I separate the car and non-car classes and try to use k-means to both classes.
the final output would be the combined classes of car or non-car. from the k-means process.
so my question is?
Is what i will be doing feasible?
Will it affect the output of my k-means if i partition the file rather than doing it as a whole?
Suggestions and answers are always appreciated :)
Thanks
What you can do, you can leverage results of Johnson-Lindenstrauss lemma where you embed you dataset into lower dimension space and when doing the kmeans computation on smaller dataset. For instance if you data matrix is A you can do:
% N is the number of data points and s is the reduced dimension
S = randn (N, s)/s q r t (s) ;
C = A ∗ S ;
% now you can do you kmeans computation on C
[idx, ctr] = kmeans(MeansOfK , k, 'Distance', 'sqEuclidean');
Basically you can use idx and ctr results for original dataset which will give you (1+epsilon) approximation. Also you can reach better results based on work by Dan Feldman which basically says that you can compute and SVD on your data and project on k/epsilon engine values to compute the kmeans value and get (1+epsilon) approximation.
UPDATE
Based on comment I'd like to suggest to leverage coresets approach, again based on the paper of Dan Feldman at el, Turning Big Data Into Tiny Data. The techniques provides with capability to reduce large volume of data into smaller with provable guarantee to provide (1+epsilon) approximation to the optimal kmeans solution. Moreover you can proceed with streaming coreset construction which will allow you to maintain O(logn * epsilon) approximation while streaming you data (section 10, figure 3), e.g. in your case partition into smaller chunks. When eventually you can run kmeans computation on the resulted coreset.
Also you probably might consider to take a look into my recent publication to get more details on how to handle your case. Here you can find also a reference in my github account if you'd like to use it.
I would say your only real option if increasing memory is not possible is to partition the data into smaller sets. When i ran a big data project using collaborative filtering algorithms we used to deal with sets as large as 700 million + and whenever we maxed out the memory it meant we needed to partition the data into smaller sets and run the algorithms on them separately.

KMeans evaluation metric not converging. Is this normal behavior or no?

I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.
For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!
Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?
Any additional perspectives would be helpful. Thanks!
I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.
There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.
SUMMARY
The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.
DISCUSSION
CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.
OTHER METRICS
Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.
I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.
N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.
WHAT TO DO NEXT?
Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.
I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)
The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.

clustering algorithm for objects which have multiple feature time series information

I am looking for clustering algorithm which can handle with multiple time series information for each objects.
For example, for company "A" we have time series of 3 features(ex. income, sales, inventory)
At the same way, company "B" also has same time series of same features. and so on..
Then, how we can make cluster between set of company?
Is there some wise way to handle this?
A lot of clustering algorithms ask you to provide some measure of the similarity or distance between two points. It is really up to you to decide what features are important and what the distance really is. One way forwards would be to use the correlation between two time series. This gives you a similarity. If you have to convert this to a distance I would use sqrt(1-r), where r is the correlation, because if you look e.g. at the equation at the bottom of http://www.analytictech.com/mb876/handouts/distance_and_correlation.htm you can see that this is proportional to a distance if you have points in n-dimensional space. If you have three different time series (income, sales, inventory) I would use the sum of the three distances worked out from the correlations between the two time series of the same type.
Another option, especially if the time series are not very long, would be to regard a time series of length n as a point in n-dimensional space and feed this into the clustering algorithm, or use http://en.wikipedia.org/wiki/Principal_component_analysis to reduce the n dimensions down to 1 by looking at the most significant components (while you are doing this, it never hurts to plot the points using the least significant components and investigate points that stand out from the others. Points where the data is in error sometimes stand out here).

Optimal placement of objects wrt pairwise similarity weights

Ok this is an abstract algorithmic challenge and it will remain abstract since it is a top secret where I am going to use it.
Suppose we have a set of objects O = {o_1, ..., o_N} and a symmetric similarity matrix S where s_ij is the pairwise correlation of objects o_i and o_j.
Assume also that we have an one-dimensional space with discrete positions where objects may be put (like having N boxes in a row or chairs for people).
Having a certain placement, we may measure the cost of moving from the position of one object to that of another object as the number of boxes we need to pass by until we reach our target multiplied with their pairwise object similarity. Moving from a position to the box right after or before that position has zero cost.
Imagine an example where for three objects we have the following similarity matrix:
1.0 0.5 0.8
S = 0.5 1.0 0.1
0.8 0.1 1.0
Then, the best ordering of objects in the tree boxes is obviously:
[o_3] [o_1] [o_2]
The cost of this ordering is the sum of costs (counting boxes) for moving from one object to all others. So here we have cost only for the distance between o_2 and o_3 equal to 1box * 0.1sim = 0.1, the same as:
[o_3] [o_1] [o_2]
On the other hand:
[o_1] [o_2] [o_3]
would have cost = cost(o_1-->o_3) = 1box * 0.8sim = 0.8.
The target is to determine a placement of the N objects in the available positions in a way that we minimize the above mentioned overall cost for all possible pairs of objects!
An analogue is to imagine that we have a table and chairs side by side in one row only (like the boxes) and you need to put N people to sit on the chairs. Now those ppl have some relations that is -lets say- how probable is one of them to want to speak to another. This is to stand up pass by a number of chairs and speak to the guy there. When the people sit on two successive chairs then they don't need to move in order to talk to each other.
So how can we put those ppl down so that every distance-cost between two ppl are minimized. This means that during the night the overall number of distances walked by the guests are close to minimum.
Greedy search is... ok forget it!
I am interested in hearing if there is a standard formulation of such problem for which I could find some literature, and also different searching approaches (e.g. dynamic programming, tabu search, simulated annealing etc from combinatorial optimization field).
Looking forward to hear your ideas.
PS. My question has something in common with this thread Algorithm for ordering a list of Objects, but I think here it is better posed as problem and probably slightly different.
That sounds like an instance of the Quadratic Assignment Problem. The speciality is due to the fact that the locations are placed on one line only, but I don't think this will make it easier to solve. The QAP in general is NP hard. Unless I misinterpreted your problem you can't find an optimal algorithm that solves the problem in polynomial time without proving P=NP at the same time.
If the instances are small you can use exact methods such as branch and bound. You can also use tabu search or other metaheuristics if the problem is more difficult. We have an implementation of the QAP and some metaheuristics in HeuristicLab. You can configure the problem in the GUI, just paste the similarity and the distance matrix into the appropriate parameters. Try starting with the robust Taboo Search. It's an older, but still quite well working algorithm. Taillard also has the C code for it on his website if you want to implement it for yourself. Our implementation is based on that code.
There has been a lot of publications done on the QAP. More modern algorithms combine genetic search abilities with local search heuristics (e. g. Genetic Local Search from Stützle IIRC).
Here's a variation of the already posted method. I don't think this one is optimal, but it may be a start.
Create a list of all the pairs in descending cost order.
While list not empty:
Pop the head item from the list.
If neither element is in an existing group, create a new group containing
the pair.
If one element is in an existing group, add the other element to whichever
end puts it closer to the group member.
If both elements are in existing groups, combine them so as to minimize
the distance between the pair.
Group combining may require reversal of order in a group, and the data structure should
be designed to support that.
Let me help the thread (of my own) with a simplistic ordering approach.
1. Order the upper half of the similarity matrix.
2. Start with the pair of objects having the highest similarity weight and place them in the center positions.
3. The next object may be put on the left or the right side of them. So each time you may select the object that when put to left or right
has the highest cost to the pre-placed objects. Goto Step 2.
The selection of Step 3 is because if you left this object and place it later this cost will be again the greatest of the remaining, and even more (farther to the pre-placed objects). So the costly placements should be done as earlier as it can be.
This is too simple and of course does not discover a good solution.
Another approach is to
1. start with a complete ordering generated somehow (random or from another algorithm)
2. try to improve it using "swaps" of object pairs.
I believe local minima would be a huge deterrent.

Unsupervised clustering with unknown number of clusters

I have a large set of vectors in 3 dimensions. I need to cluster these based on Euclidean distance such that all the vectors in any particular cluster have a Euclidean distance between each other less than a threshold "T".
I do not know how many clusters exist. At the end, there may be individual vectors existing that are not part of any cluster because its euclidean distance is not less than "T" with any of the vectors in the space.
What existing algorithms / approach should be used here?
You can use hierarchical clustering. It is a rather basic approach, so there are lots of implementations available. It is for example included in Python's scipy.
See for example the following script:
import matplotlib.pyplot as plt
import numpy
import scipy.cluster.hierarchy as hcluster
# generate 3 clusters of each around 100 points and one orphan point
N=100
data = numpy.random.randn(3*N,2)
data[:N] += 5
data[-N:] += 10
data[-1:] -= 20
# clustering
thresh = 1.5
clusters = hcluster.fclusterdata(data, thresh, criterion="distance")
# plotting
plt.scatter(*numpy.transpose(data), c=clusters)
plt.axis("equal")
title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters)))
plt.title(title)
plt.show()
Which produces a result similar to the following image.
The threshold given as a parameter is a distance value on which basis the decision is made whether points/clusters will be merged into another cluster. The distance metric being used can also be specified.
Note that there are various methods for how to compute the intra-/inter-cluster similarity, e.g. distance between the closest points, distance between the furthest points, distance to the cluster centers and so on. Some of these methods are also supported by scipys hierarchical clustering module (single/complete/average... linkage). According to your post I think you would want to use complete linkage.
Note that this approach also allows small (single point) clusters if they don't meet the similarity criterion of the other clusters, i.e. the distance threshold.
There are other algorithms that will perform better, which will become relevant in situations with lots of data points. As other answers/comments suggest you might also want to have a look at the DBSCAN algorithm:
https://en.wikipedia.org/wiki/DBSCAN
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
For a nice overview on these and other clustering algorithms, also have a look at this demo page (of Python's scikit-learn library):
http://scikit-learn.org/stable/modules/clustering.html
Image copied from that place:
As you can see, each algorithm makes some assumptions about the number and shape of the clusters that need to be taken into account. Be it implicit assumptions imposed by the algorithm or explicit assumptions specified by parameterization.
The answer by moooeeeep recommended using hierarchical clustering. I wanted to elaborate on how to choose the treshold of the clustering.
One way is to compute clusterings based on different thresholds t1, t2, t3,... and then compute a metric for the "quality" of the clustering. The premise is that the quality of a clustering with the optimal number of clusters will have the maximum value of the quality metric.
An example of a good quality metric I've used in the past is Calinski-Harabasz. Briefly: you compute the average inter-cluster distances and divide them by the within-cluster distances. The optimal clustering assignment will have clusters that are separated from each other the most, and clusters that are "tightest".
By the way, you don't have to use hierarchical clustering. You can also use something like k-means, precompute it for each k, and then pick the k that has the highest Calinski-Harabasz score.
Let me know if you need more references, and I'll scour my hard disk for some papers.
Check out the DBSCAN algorithm. It clusters based on local density of vectors, i.e. they must not be more than some ε distance apart, and can determine the number of clusters automatically. It also considers outliers, i.e. points with an unsufficient number of ε-neighbors, to not be part of a cluster. The Wikipedia page links to a few implementations.
Use OPTICS, which works well with large data sets.
OPTICS: Ordering Points To Identify the Clustering Structure Closely related to DBSCAN, finds core sample of high density and expands clusters from them 1. Unlike DBSCAN, keeps cluster hierarchy for a variable neighborhood radius. Better suited for usage on large datasets than the current sklearn implementation of DBSCAN
from sklearn.cluster import OPTICS
db = OPTICS(eps=3, min_samples=30).fit(X)
Fine tune eps, min_samples as per your requirement.
I want to add to moooeeeep's answer by using hierarchical clustering.
This solution work for me, though it quite "random" to pick threshold value.
By referrence to other source and test by myself, I got better method and threshold could be easily picked by dendrogram:
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
ori_array = ["Your_list_here"]
ward_array = hierarchy.ward(pdist(ori_array))
dendrogram = hierarchy.dendrogram(hierarchy.linkage(ori_array, method = "ward"))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()
You will see the plot like this
click here.
Then by drawing the horizontal line, let say at distance = 1, the number of conjunctions will be your desire number of clusters. So here I choose threshold = 1 for 4 clusters.
threshold = 1
clusters_list = hierarchy.fcluster(ward_array, threshold, criterion="distance")
print("Clustering list: {}".format(clusters_list))
Now each value in cluster_list will be an assigned cluster-id of the corresponding point in ori_array.
You may have no solution: it is the case when the distance between any two distinct input data points is always greater than T. If you want to compute the number of clusters only from the input data, you may look at MCG, a hierarchical clustering method with an automatic stop criterion: see the free seminar paper at https://hal.archives-ouvertes.fr/hal-02124947/document (contains bibliographic references).
I needed a way to "fuzzy sort" lines from OCR output, when the output is sometimes out of order, but within blocks, the lines are usually in order. In this case, the items to sort are dictionaries, which describe words at a location 'x','y' and with size 'w','h'. The general clustering algorithms seemed like overkill and I needed to maintain the order of the items during the sort. Here, I can set the tolerance tol to be about 1/4 the line spacing, and this is called with the field being 'y'.
def fuzzy_lod_sort(lod, field, tol):
# fuzzy sort lod into bins within +/- tol
# maintain original order.
# first determine the bins.
val_list = [d[field] for d in lod]
vals_sorted = sorted(val_list)
bins_lol = []
i = 0
for j, v in enumerate(vals_sorted):
if not j:
bins_lol.append([v])
continue
cur_bin_avg = statistics.mean(bins_lol[i])
if abs(cur_bin_avg - v) <= tol:
bins_lol[i].append(v)
continue
i += 1
bins_lol.append([v])
# now sort into the bins, maintaining the original order.
# the bins will be the center of the range of 'y'.
bins = [statistics.mean(binlist) for binlist in bins_lol]
# initialize the list of bins
lolod = []
for _ in range(len(bins)):
lolod.append([])
for d in lod:
bin_idx = closest_bin_idx(bins, d[field])
lolod[bin_idx].append(d)
# now join the bins.
result_lod = []
for lod in lolod:
result_lod.extend(lod)
return result_lod
def closest_bin(bins, val):
return min(bins, key=lambda bin:abs(bin - val))
def closest_bin_idx(bins, val):
return bins.index(closest_bin(bins, val))
The trouble is that the 'y' coordinate of the OCR output are based on the outline around the word and a later word in the same line might have a 'y' coordinate that is lower than an earlier word. So a full sort by 'y' does not work. This is much like the clustering algorithm, but the intention is a bit different. I am not interested in the statistics of the data points, but I am interested in exactly which cluster each is placed and also it is important to maintain the original order.
Maybe there is some way to fuzzy sort using the sorting built-ins, and it might be an alternative to the clustering options for 1-D problems.

Resources