Looking for algorithm: Clustering by 'similarity' - algorithm

I have a set of 'vectors' and i need to sort them basing on their 'similarity'.
Like this: vectors {1,0,0} {1,1,0} {0,1,0} {1,0,1} are pretty similiar and should be close to each other in the end, but vectors {1, 0, 0} {8, 0, 0} {0, 5, 0} - are not.
The metric between A and B is max(abs(A[i]-B[i])), but what kind of algorithms can sort things basing on relative comparison?
upd:
input: array of N vectors
ouput: array of N vectors, where nearest by index vectors(arr[i] arr[i+1] for example) are 'similiar' = metric between arr[i] and arr[i+1] is as low as possibly for any i, j.
metric - maximum difference of vector components
upd2:
as it seems now, #jogojapan was right - i need to cluster vectors and after, print them in some linear order, group by group

That's a distance induced by max norm (aka sup norm or l-infinity norm). A distance is not enough to create a linear ordering, if by sorting you mean ordring in a sequence.

Sorting is inherently a one-dimensional problem. What you're describing here sounds more like a weighted graph but it's not clear what your goal is. You may also find some concepts from information theory such as Hamming Distance to be useful if you're trying to identify the vector which is "closest" to a known vector.

Well, the obvious approach would be the (IMHO badly named) "hierarchical clustering", which always merges those clusters with the smallest distance. You can plug in your metric there. Most implementations are in O(n^3) and thus not useful for large datasets. Plus, you get a huge dendrogram that is hard to read.
You might want to give OPTICS a try. Look it up on Wikipedia. It might satisfy your needs quite well, since it in fact sorts the points. It will walk from one cluster to another, and can in fact produce a hierarchical (as in "nested") clustering. A good implementation should run in O(n^2) without index structures and in O(n log n) with index acceleration.

Any sorting algorithm can give you the results you want.
The question is how you are going to compare your vectors. Do you just want to compare them by magnitude? Or something else?

Related

maxmin clustering algorithm

I read a a paper that mention max min clustering algorithm, but i don't really quite understand what this algorithm does. Googling "max min clustering algorithm" doesn't yield any helpful result. does anybody know what this algorithm mean? this is an excerpt of the paper:
Max-min clustering proceeds by choosing an observation at random as the first centroid c1, and by setting the set C of centroids to {c1}. During the ith iteration, ci is chosen such that it maximizes the minimum Euclidean distance between ci and observations in C. Max-min clustering is preferable to a density-based clustering algorithm (e.g. k-means) which would tend to select many examples from the dense group of non-seizure data points.
I don't quite understand the bolded part.
link to paper is here
We choose each new centroid to be as far as possible from the existing centroids. Here's some Python code.
def maxminclustering(observations, k):
observations = set(observations)
if k < 1 or not observations: return set()
centroids = set([observations.pop()])
for i in range(min(k - 1, len(observations))):
newcentroid = max(observations,
key=lambda observation:
min(distance(observation, centroid)
for centroid in centroids))
observations.remove(newcentroid)
centroids.add(newcentroid)
return centroids
This sounds a lot like the farthest-points heuristic for seeding k-means, but then not performing any k-means iterations at all.
This is a surprisingly simple, but quite effective strategy. Basically it will find a number of data points that are well spread out, which can make k-means converge fast. Usually, one would discard the first (random) data point.
It only works well for low values of k though (it avoids placing centroids in the center of the data set!), and it is not very favorable to multiple runs - it tends to choose the same initial centroids again.
K-means++ can be seen as a more randomized version of this. Instead of always choosing the farthes object, it chooses far objects with increased likelihood, but may at random also choose a near neighbor. This way, you get more diverse results when running it multiple times.
You can try it out in ELKI, it is named FarthestPointsInitialMeans. If you choose the algorithm SingleAssignmentKMeans, then it will not perform k-means iterations, but only do the initial assignment. That will probably give you this "MaxMin clustering" algorithm.

Indexing for similarity search

I have about 100M numeric vectors (Minhash fingerprints), each vector contains 100 integer numbers between 0 and 65536, and I'm trying to do a fast similarity search against this database of fingerprints using Jaccard similarity, i.e. given a query vector (e.g. [1,0,30, 9, 42, ...]) find the ratio of intersection/union of this query set against the database of 100M sets.
The requirement is to return k "nearest neighbors" of the query vector in <1 sec (not including indexing/File IO time) on a laptop. So obviously some kind of indexing is required, and the question is what would be the most efficient way to approach this.
notes:
I thought of using SimHash but in this case actually need to know the size of intersection of the sets to identify containment rather than pure similarity/resemblance, but Simhash would lose that information.
I've tried using a simple locality sensitive hashing technique as described in ch3 of Jeffrey Ullman's book by dividing each vector into 20 "bands" or snippets of length 5, converting these snippets into strings (e.g. [1, 2, 45, 2, 3] - > "124523") and using these strings as keys in a hash table, where each key contains "candidate neighbors". But the problem is that it creates too many candidates for some of these snippets and changing number of bands doesn't help.
I might be a bit late, but I would suggest IVFADC indexing by Jegou et al.: Product Quantization for Nearest Neighbor Search
It works for L2 Distance/dot product similarity measures and is a bit complex, but it's particularly efficient in terms of both time and memory.
It is also implemented in the FAISS library for similarity search, so you could also take a look at that.
One way to go about this is the following:
(1) Arrange the vectors into a tree (a radix tree).
(2) Query the tree with a fuzzy criteria, in other words, a match is if the difference in values at each node of the tree is within a threshold
(3) From (2) generate a subtree that contains all the matching vectors
(4) Now, repeat process (2) on the sub tree with a smaller threshold
Continue until the subtree has K items. If K has too few items, then take the previous tree and calculate the Jacard distance on each member of the subtree and sort to eliminate the worst matches until you have only K items left.
answering my own question after 6 years, there is a benchmark for approximate nearest neighbor search with many algorithms to solve this problem: https://github.com/erikbern/ann-benchmarks, the current winner is "Hierarchical Navigable Small World graphs": https://github.com/nmslib/hnswlib
You can use off-the-shelf similarity search services such as AWS-ES or Pinecone.io.

Optimization problem - vector mapping

A and B are sets of N dimensional vectors (N=10), |B|>=|A| (|A|=10^2, |B|=10^5). Similarity measure sim(a,b) is dot product (required). The task is following: for each vector a in A find vector b in B, such that sum of similarities ss of all pairs is maximal.
My first attempt was greedy algorithm:
find the pair with the highest similarity and remove that pair from A,B
repeat (1) until A is empty
But such greedy algorithm is suboptimal in this case:
a_1=[1, 0]
a_2=[.5, .4]
b_1=[1, 1]
b_2=[.9, 0]
sim(a_1,b_1)=1
sim(a_1,b_2)=.9
sim(a_2,b_1)=.9
sim(a_2, b_2)=.45
Algorithm returns [a_1,b_1] and [a_2, b_2], ss=1.45, but optimal solution yields ss=1.8.
Is there efficient algo to solve this problem? Thanks
This is essentially a matching problem in weighted bipartite graph. Just assume that weight function f is a dot product (|ab|).
I don't think the special structure of your weight function will simplify problem a lot, so you're pretty much down to finding a maximum matching.
You can find some basic algorithms for this problem in this wikipedia article. Although at first glance they don't seem viable for your data (V = 10^5, E = 10^7), I would still research them: some of them might allow you to take advantage of your 'lame' set of vertixes, with one part orders of magnitude smaller than the other.
This article also seems relevant, although doesn't list any algorithms.
Not exactly a solution, but hope it helps.
I second Nikita here, it is an assignment (or matching) problem. I'm not sure this is computationally feasible for your problem, but you could use the Hungarian algorithm, also known as Munkres' assignment algorithm, where the cost of assignment (i,j) is the negative of the dot product of ai and bj. Unless you happen to know how the elements of A and B are formed, I think this is the most efficient known algorithm for your problem.

Fast way to compute the minimal distance of two sets of k-dimensional vectors

I two sets of k-dimensional vectors, where k is around 500 and the number of vectors is usually smaller. I want to compute the (arbitrarily defined) minimal distance between the two sets.
A naive approach would be this:
(loop for a in set1
for b in set2
minimizing (distance a b))
However, this requires O(n² * distance) computations. Is there a faster way of doing this?
I don't think you can do better than O(n^2) when the distance is arbitrary (you have to examine each of the possible distances!). For a given distance function we might be able to exploit the properties of the function, but there won't be any general algorithm which works with any distance function in better than O(n^2) (i.e. o(n^2) : note smallOh).
If your data is dynamic and you have to keep obtaining the closest pair of points at different times, for arbitrary distance function the following papers by Eppstein will probably help (which have special update operations in order to make finding the closest pair of points quick):
http://www.ics.uci.edu/~eppstein/projects/pairs/Papers/Epp-SODA-98.pdf. [O(nlog^2(n)) update time]
http://academic.research.microsoft.com/Paper/1847461.aspx
You will be able to adapt the above one set algorithms to a two set algorithm (for instance, by defining distance between points of same set to be infinity).
For Euclidean type (L^p) distance, there are known O(nlogn) time algorithms, which work with a given set of points (i.e. you dont need to have any special update algorithms):
http://www.cse.iitd.ernet.in/~ssen/cs852/scribe/scribe2/lec.pdf
http://en.wikipedia.org/wiki/Closest_pair_of_points_problem
Of course, the L^p is for one set, but you might be able to adapt it for two sets.
If you give your distance function, it might be easier for us to help you.
Hope it helps. Good luck!
If the components of your vectors are scalars I would guess that for your case of a moderate k=500 the O(n²) approach is probably as fast as you can get. You can simplify your calculation by minimizing distance². Also, the distance(A_i, B_i) = distance(B_i, A_i), so make sure you only compare them once (you only have 500!/(500-2)! pairs, not 500²).
If the components are m-dimensional vectors A and B instead, you could store the components of vector A in a R-tree or a kd-tree and then find the closest pair by iterating over all components of vector B and finding its closest partner from A--- this would be O(n). Don't forget that big-O is for n->infinity, so the trees might come with some pretty expensive constant term (i.e. this approach might only make sense for large k or if vector A is always the same).
Put the two sets of coordinates into a Spatial Index, e.g. a KD-tree.
You then compute the intersection of these two indices.

Sort array in ascending order while minimizing "cost"

I'm taking comp 2210 (Data Structures) next semester and I've been doing the homework for the summer semester that is posted online. Until now, I've had no problems doing the assignments. Take a look at assignment 4 below, and see if you can give me a hint as to how to approach it. Please don't provide a complete algorithm, just an approach. Thanks!
A “costed sort” is an algorithm in which a sequence of values must be arranged in ascending order. The sort is
carried out by interchanging the position of two values one at a time until the sequence is in the proper order. Each
interchange incurs a cost, which is calculated as the sum of the two values involved in the interchange. The total
cost of the sort is the sum of the cost of the interchanges.
For example, suppose the starting
sequence were {3, 2, 1}. One possible
series of interchanges is
Interchange 1: {3, 1, 2} interchange cost = 0
Interchange 2: {1, 3, 2} interchange cost = 4
Interchange 3: {1, 2, 3} interchange cost = 5,
given a total cost of 9
You are to write a program that determines the minimal cost to arrange a specific sequence of numbers.
Edit: The professor does not allow brute forcing.
If you want to surprise your professor, you could use Simulated Annealing. Then again, if you manage that, you can probably skip a few courses :). Note that this algorithm will only give an approximate answer.
Otherwise: try a Backtracking algorithm, or Branch and Bound. These will both find the optimal answer.
What do you mean "brute forcing?" Do you mean "try all possible combinations and select the cheapest?" Just checking.
I think "branch and bound" is what you're looking for - check any source on algorithms. It is "like" brute force, except as you try a sequence of moves, as soon as that sequence of moves is less optimal than any other sequence of moves tried so far, you can abandon the sequence that got you to that point - the cost. This is one flavor of the "backtracking" mentioned above.
My preferred language for doing this would be Prolog but I'm weird.
Simulated Annealing is a PROBABLISTIC algorithm - if the solution space has local minima, then you may be trapped in one and get what you think is the right answer but isn't. There are ways around that and the literature all about that can be found but I don't agree that it's the tool you want.
You could try the related genetic algorithms too if that's the way you want to go.
Have you learned trees? You could create a tree with all possible changes leading to the desired result. The trick, of course, is to avoid creating the whole tree -- particularly when a part of it is obviously not the best solution, right?
I think the appropriate approach is to think hard about what defining properties a minimal "cost" sort has. Then figure out the cost by simulating this ideal sort. The key element here is you don't have to implement a general minimal cost sorting algorithm.
For example, let's say the defining property of a minimal cost sort is that every exchange puts at least one of the exchanged element in it's sorted position (I don't know if this is true). Every exchange based sort would love to be able to have this property, but it's not easy(possible?) in the general case. However You can easily create a program that takes an unsorted array, takes the sorted version (which itself can be generated by an unoptimal algorithm), and then using this information decides the minimum cost to achieve the sorted array from the unsorted array.
Description
I think the cheapest way to do this is to swap the cheapest misplaced item with the item that belongs in its spot. I believe this reduces cost by moving the most expensive things just once. If there are n-elements that are out of place, then there will be at most n-1 swaps to put them in place, at a cost of n-1 * cost of least item + cost of all other out of place.
If the globally cheapest element is not misplaced and the spread between this cheapest one and the cheapest misplaced one is great enough, it can be cheaper to swap the cheapest one in its correct place with the cheapest misplaced one. The cost then is n-1 * cheapest + cost of all out of place + cost of cheapest out of place.
Example
For [4,1,2,3], this algorithm exchanges (1,2) to produce:
[4,2,1,3]
and then swaps (3,1) to produce:
[4,2,3,1]
and then swaps (4,1) to produce:
[1,2,3,4]
Notice that each misplaced item in [2,3,4] is moved only once and is swapped with the lowest cost item.
Code
Ooops: "Please don't provide a complete algorithm, just an approach." Removed my code.
In an effort to only get you going on it this may not make complete sense.
Determine all possible moves, and the cost for each move and store those somehow, perform the least expensive move, then determine the moves that can be performed from this variation, storing those with the rest of your stored moves, perform the least expensive etc until the array is sorted.
I love solving things like this.
This problem is also known as the Silly Sort problem in some ACM contests. Take a look at this solution using Divide & Conquer.
Try different sorting algorithms on the same input data and print the minimum.

Resources