BucketRandomProjectionLSH KNN parameters - knn

I am trying to use KNN algorithm from spark 2.2.0. I am wondering how I should set my bucket length. The record count/number of features varies, so I think it is better to set length by some conditions. How should I set the bucket length for better performance? I rescaled all the features in vector into 0 to 1.
Also, is there any way to guarantee KNN algorithm to return minimum number of elemnets? I found out that sometimes number of elements inside the bucket is smaller than queried k, and I might want at least one or two neighbors as result.
Thanks~
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH

According to Scaladocs
If input vectors are normalized, 1-10 times of pow(numRecords,
-1/inputDim) would be a reasonable value

Related

Finding subsets of a set of points that are all a maximum distance from each other?

I have a csv file with the following format:
thing1_id, thing2_id, similarity
The similarity is between 50 and 100. I've filtered out all pairs with similarity less than 50, but I do have the full set where the lowest is around 25. There are duplicate comparisons at the moment, i.e. thing1-thing2 is a separate entry from thing2-thing1.
I'm interested in writing a program that will take in a similarity threshold (s) and minimum number of items per set (n), and give me all sets of size n or greater with things that are all at least s% similar to all other elements in that set.
I was thinking a graph might be the best data structure to do this? Where each thing is a node, and the similarity is a weighted edge. I'm not too sure where to go from here without taking way too much memory. This is a set of around 400 things.

lightgbm how to deal with No further splits with positive gain, best gain: -inf

how to deal with [Warning] No further splits with positive gain, best gain: -inf
is there any parameters not suit?
Some explanation from lightGBM's issues:
it means the learning of tree in current iteration should be stop, due to cannot split any more.
I think this is caused by "min_data_in_leaf":1000, you can set it to a smaller value.
This is not a bug, it is a feature.
The output message is to warn user that your parameters may be wrong, or your dataset is not easy to learn.
link: https://github.com/Microsoft/LightGBM/issues/640
So on the contrary, the data is hard to fit.
This means that no improvement can be gained by adding additional leaves to the tree subject to the restrictions of the hyperparameters. It is not necessarily a bad thing, as limiting the depth of the tree can prevent overfitting. However, if the tree is underfitting the data, try tweaking these hyperparameters:
decrease min_data_in_leaf - minimum number of data points in a leaf
decrease min_sum_hessian_in_leaf - Minimum sum of the Hessian (second derivative of the objective function evaluated for each observation) for observations in a leaf. For some regression objectives, this is just the minimum number of records that have to fall into each node. For classification objectives, it represents a sum over a distribution of probabilities. It works like min_child_weight in xgboost.
increase max_bin or max_bin_by_feature when creating dataset
LightGBM training buckets continuous features into discrete bins to improve training speed and reduce memory requirements for training. This binning is done one time during Dataset construction. Increasing the number of bins per feature can increase the number of splits that can be made.
max_bin controls the maximum number of bins that features will bucketed into. It is also possible to set this maximum feature-by-feature, by passing max_bin_by_feature.
set 'verbosity': -1 in params, it works!
Increase max_depth or set it to -1.

KMeans evaluation metric not converging. Is this normal behavior or no?

I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.
For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!
Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?
Any additional perspectives would be helpful. Thanks!
I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.
There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.
SUMMARY
The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.
DISCUSSION
CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.
OTHER METRICS
Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.
I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.
N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.
WHAT TO DO NEXT?
Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.
I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)
The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.

"Covering" the space of all possible histogram shapes

There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

Indexing for similarity search

I have about 100M numeric vectors (Minhash fingerprints), each vector contains 100 integer numbers between 0 and 65536, and I'm trying to do a fast similarity search against this database of fingerprints using Jaccard similarity, i.e. given a query vector (e.g. [1,0,30, 9, 42, ...]) find the ratio of intersection/union of this query set against the database of 100M sets.
The requirement is to return k "nearest neighbors" of the query vector in <1 sec (not including indexing/File IO time) on a laptop. So obviously some kind of indexing is required, and the question is what would be the most efficient way to approach this.
notes:
I thought of using SimHash but in this case actually need to know the size of intersection of the sets to identify containment rather than pure similarity/resemblance, but Simhash would lose that information.
I've tried using a simple locality sensitive hashing technique as described in ch3 of Jeffrey Ullman's book by dividing each vector into 20 "bands" or snippets of length 5, converting these snippets into strings (e.g. [1, 2, 45, 2, 3] - > "124523") and using these strings as keys in a hash table, where each key contains "candidate neighbors". But the problem is that it creates too many candidates for some of these snippets and changing number of bands doesn't help.
I might be a bit late, but I would suggest IVFADC indexing by Jegou et al.: Product Quantization for Nearest Neighbor Search
It works for L2 Distance/dot product similarity measures and is a bit complex, but it's particularly efficient in terms of both time and memory.
It is also implemented in the FAISS library for similarity search, so you could also take a look at that.
One way to go about this is the following:
(1) Arrange the vectors into a tree (a radix tree).
(2) Query the tree with a fuzzy criteria, in other words, a match is if the difference in values at each node of the tree is within a threshold
(3) From (2) generate a subtree that contains all the matching vectors
(4) Now, repeat process (2) on the sub tree with a smaller threshold
Continue until the subtree has K items. If K has too few items, then take the previous tree and calculate the Jacard distance on each member of the subtree and sort to eliminate the worst matches until you have only K items left.
answering my own question after 6 years, there is a benchmark for approximate nearest neighbor search with many algorithms to solve this problem: https://github.com/erikbern/ann-benchmarks, the current winner is "Hierarchical Navigable Small World graphs": https://github.com/nmslib/hnswlib
You can use off-the-shelf similarity search services such as AWS-ES or Pinecone.io.

Resources