I am using matlab and i have this very very big .mat file named MeansOfK that contains almost 5,000,000 x N. My test data consists of Car and Non-car. My problem is that when i try to use k-means to the MeansofK. It always runs out of memory.
[idx, ctr] = kmeans(MeansOfK , k, 'Distance', 'sqEuclidean');
My Options are
1.i use the divide and conquer technique wherein i partition the car and non-car to smaller partitions and put it into k-means.
2.I separate the car and non-car classes and try to use k-means to both classes.
the final output would be the combined classes of car or non-car. from the k-means process.
so my question is?
Is what i will be doing feasible?
Will it affect the output of my k-means if i partition the file rather than doing it as a whole?
Suggestions and answers are always appreciated :)
Thanks
What you can do, you can leverage results of Johnson-Lindenstrauss lemma where you embed you dataset into lower dimension space and when doing the kmeans computation on smaller dataset. For instance if you data matrix is A you can do:
% N is the number of data points and s is the reduced dimension
S = randn (N, s)/s q r t (s) ;
C = A ∗ S ;
% now you can do you kmeans computation on C
[idx, ctr] = kmeans(MeansOfK , k, 'Distance', 'sqEuclidean');
Basically you can use idx and ctr results for original dataset which will give you (1+epsilon) approximation. Also you can reach better results based on work by Dan Feldman which basically says that you can compute and SVD on your data and project on k/epsilon engine values to compute the kmeans value and get (1+epsilon) approximation.
UPDATE
Based on comment I'd like to suggest to leverage coresets approach, again based on the paper of Dan Feldman at el, Turning Big Data Into Tiny Data. The techniques provides with capability to reduce large volume of data into smaller with provable guarantee to provide (1+epsilon) approximation to the optimal kmeans solution. Moreover you can proceed with streaming coreset construction which will allow you to maintain O(logn * epsilon) approximation while streaming you data (section 10, figure 3), e.g. in your case partition into smaller chunks. When eventually you can run kmeans computation on the resulted coreset.
Also you probably might consider to take a look into my recent publication to get more details on how to handle your case. Here you can find also a reference in my github account if you'd like to use it.
I would say your only real option if increasing memory is not possible is to partition the data into smaller sets. When i ran a big data project using collaborative filtering algorithms we used to deal with sets as large as 700 million + and whenever we maxed out the memory it meant we needed to partition the data into smaller sets and run the algorithms on them separately.
Related
I am trying to understand PCA and K-Means algorithms in order to extract some relevant features from a set of features.
I don't know what branch of computer science study these topics, seems on internet there aren't good resources, just some paper that I don't understand well. An example of paper http://www.ifp.illinois.edu/~qitian/e_paper/icip02/icip02.pdf
I have csv files of pepole walks composed as follow:
TIME, X, Y, Z, these values are registred by the accelerometer
What I did
I transformed the dataset as a table in Python
I used tsfresh, a Python library, to extract from each walk a vector of features, these features are a lot, 2k+ features from each walk.
I have to use PFA, Principal Feature Analysis, to select the relevant features from the set of
vectors features
In order to do the last point, I have to reduce the dimension of the set of features walks with PCA (PCA will make the data different from the original one cause it modifies the data with the eigenvectors and eigenvalues of the covariance matrix of the original data). Here I have the first question:
How the input of PCA should look? The rows are the number of walks and the columns are the features or viceversa, so the rows are the number of the features and the columns are the number of walks of pepole?
After I reduced this data, I should use the K-Means algorithm on the reduced 'features' data. How the input should look in the K-Means? And what's the propouse on using this algorithm? All I know this algorithm it's used to 'cluster' some data, so in each cluster there are some 'points' based on some rule. What I did and think is:
If I use in PCA an input that looks like: the rows are the number of walks and the columns are the number of features, then for K-Means I should change the columns with rows cause in this way each point it's a feature (but this is not the original data with the features, it's just the reduced one, so I don't know). So then for each cluster I see with euclidean distance who has the lower distance from the centroid and select that feature. So how many clusters I should declare? If I declare that the clusters are the same as the number of features, I will extract always the same number of features. How can I say that a point in the reduced data correspond to this feature in the original set of features?
I know it's not correct what I am saying maybe, but I am trying to understand it, can some of you help me? If am I in the right way? Thanks!
For the PCA, make sure you separate the understanding of the method the algorithm uses (eigenvectors and such) and the result. The result, is a linear mapping, mapping the original space A, to A', where possibly, the dimension (number of features in your case) is less than the original space A.
So the first feature/element in space A', is a linear combination of features of A.
The row/column depends on implementation, but if you use scikit PCA the columns are the features.
You can feed the PCA output, the A' space, to K-means, and it will cluster them, based on a space of usually reduced dimension.
Each point will be part of a cluster, and the idea is that if you would calculate K-Means on A, you would probably end up with the same/similar clusters like with A'. Computationally A' is a lot cheaper. You now have a clustering, on A' and A. As we agree that points similar in A' are also similar in A.
The number of clusters is difficult to answer, if you don't know anything search the elbow method. But say you want to get a sense of different type of things you have, I argue go for 3~8 and not too much, compare 2-3 points closest to
each center, and you have something consumable. The number of features can be larger than the number of clusters. e.g. If we want to know the most dense area in some area (2D) you can easily have 50 clusters, to get a sense where 50 cities could be. Here we have number of cluster way higher than space dimension, and it makes sense.
I've been trying to figure out how to efficiently calculate the covariance in a moving window, i.e. moving from a set of values (x[0], y[0])..(x[n-1], y[n-1]) to a new set of values (x[1], y[1])..(x[n], y[n]). In other words, the value (x[0], y[0]) gets replaces by the value (x[n], y[n]). For performance reasons I need to calculate the covariance incrementally in the sense that I'd like to express the new covariance Cov(x[1]..x[n], y[1]..y[n]) in terms of the previous covariance Cov(x[0]..x[n-1], y[0]..y[n-1]).
Starting off with the naive formula for covariance as described here:
[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance][1]
All I can come up with is:
Cov(x[1]..x[n], y[1]..y[n]) =
Cov(x[0]..x[n-1], y[0]..y[n-1]) +
(x[n]*y[n] - x[0]*y[0]) / n -
AVG(x[1]..x[n]) * AVG(y[1]..y[n]) +
AVG(x[0]..x[n-1]) * AVG(y[0]..y[n-1])
I'm sorry about the notation, I hope it's more or less clear what I'm trying to express.
However, I'm not sure if this is sufficiently numerically stable. Dealing with large values I might run into arithmetic overflows or other (for example cancellation) issues.
Is there a better way to do this?
Thanks for any help.
It looks like you are trying some form of "add the new value and subtract the old one". You are correct to worry: this method is not numerically stable. Keeping sums this way is subject to drift, but the real killer is the fact that at each step you are subtracting a large number from another large number to get what is likely a very small number.
One improvement would be to maintain your sums (of x_i, y_i, and x_i*y_i) independently, and recompute the naive formula from them at each step. Your running sums would still drift, and the naive formula is still numerically unstable, but at least you would only have one step of numerical instability.
A stable way to solve this problem would be to implement a formula for (stably) merging statistical sets, and evaluate your overall covariance using a merge tree. Moving your window would update one of your leaves, requiring an update of each node from that leaf to the root. For a window of size n, this method would take O(log n) time per update instead of the O(1) naive computation, but the result would be stable and accurate. Also, if you don't need the statistics for each incremental step, you can update the tree once per each output sample instead of once per input sample. If you have k input samples per output sample, this reduces the cost per input sample to O(1 + (log n)/k).
From the comments: the wikipedia page you reference includes a section on Knuth's online algorithm, which is relatively stable, though still prone to drift. You should be able to do something comparable for covariance; and resetting your computation every K*n samples should limit the drift at minimal cost.
Not sure why no one has mentioned this, but you can use the Welford online algorithm which relies on the running mean:
The equations should look like:
the online mean given by:
I have two datasets, let's say checkins and POIs, and I have to join them based on geo-coordinates: let's say, if user was seen in radius of N km near POI, I need to join them (in other words, I want to collect all users near each POI for further manipulations). But I have an issue with this geo matching...
Initially I see two different opportunity:
1) implement LSH (locality sensitive hashing) - looks really complicated and performance might suffer as well
2) split all map in regions (2D-matrix) and then calculated how many regions are within N km distance from checkin or POI - then emit all of them - in result some deduplication must be applied - so, not sure that it is valid algo at all
any best practices?
Interesting problem.
I assume you have considered the naive brute-force approach and found it too time-consuming for your purposes. In the brute-force approach you calculate distances between each of n POIs and each of m checkins, leading to time complexity of O(n*m).
The most simple heuristic I can think of that is also applicable in Spark is to reduce a full linear scan of one dataset by grouping the dataset elements into buckets. Something like this:
case class Position(x: Double, y: Double)
val checkins: RDD[Position] = ???
val radius = 10
val checkinBuckets = checkins.groupBy(pos => (pos.x/radius).toInt)
Instead of a full linear scan, one can search only the corresponding, the next and the previous bucket. If necessary, one can create the second level by grouping the buckets to further speedup the lookup. Also, one should take care of details like correct rounding of pos.x/radius, gps distance calculation, etc.
Of course, you can always dive into various approaches for the nearest neighbour search problem as proped by #huitseeker. Also, this paper has a nice intro intro NNS.
There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.
What is the best way to test a clustering algorithm? I am using an agglomerative clustering algorithm with a stop criterion. How do I test if the clusters are formed correctly or not?
A good rule of thumb for evaluating how much a graph can be clustered (on a coarse grained level) has to do with the "eigenvalue gap". Given a weighted graph A, calculate the eigenvalues and sort them (this is the eigenvalue spectrum). When plotted, if there is a large jump in the spectrum at some point, there is a natural corresponding block to partition the graph.
Below is an example (in numpy python) that shows, given an almost block diagonal matrix there a large gap in the eigenvalue spectrum at the number of blocks (parameterized by c in the code). Note that a matrix permutation (identical to labeling your graph nodes) still gives the same spectral gap:
from numpy import *
import pylab as plt
# Make a block diagonal matrix
N = 30
c = 5
A = zeros((N*c,N*c))
for m in xrange(c):
A[m*N:(m+1)*N, m*N:(m+1)*N] = random.random((N,N))
# Add some noise
A += random.random(A.shape) * 0.1
# Make symmetric
A += A.T - diag(A.diagonal())
# Show the original matrix
plt.subplot(131)
plt.imshow(A.copy(), interpolation='nearest')
# Permute the matrix for effect
idx = random.permutation(N*c)
A = A[idx,:][:,idx]
# Compute eigenvalues
L = linalg.eigvalsh(A)
# Show the results
plt.subplot(132)
plt.imshow(A, interpolation='nearest')
plt.subplot(133)
plt.plot(sorted(L,reverse=True))
plt.plot([c-.5,c-.5],[0,max(L)],'r--')
plt.ylim(0,max(L))
plt.xlim(0,20)
plt.show()
It depends on what you want to test against.
When testing your own implementation of a known algorithm, you might want to compare the results with that of a known good implementation.
Hierarchical clustering is hard to test with respect to quality, as it is hierarchical. The common measures such as Rand index etc. are only valid for strict partitionings. You can get a strict partitioning from a hierarchical clustering, but then you need to fix the height to cut at.
Ideally you have some kind of pre-clustered data (supervised learning) and test the results of your clustering algorithm on that. Simply count the number of correct classifications divided by the total number of classifications performed to get an accuracy score.
If you are doing unsupervised learning, then there is really no way to evaluate your algorithm.
It is sometimes useful to construct input data where there is a known, and perhaps obvious, answer by construction. For a clustering algorithm, you might construct data with N clusters such that the maximum distance between any two points in the same cluster is smaller than the minimum distance between any two points in different clusters. Another option would be to generate a number of different data sets plotable as 2-d scatter diagrams with clusters obvious to the eye, then compare the result from your algorithm with this structure, perhaps moving the clusters together to see when the algorithm fails to see them.
You might be able to do better given knowledge of your particular clustering algorithm, but the above might at least have some chance of flushing obvious bugs from cover.