Given that I have two lists that each contain a separate subset of a common superset, is
there an algorithm to give me a similarity measurement?
Example:
A = { John, Mary, Kate, Peter } and B = { Peter, James, Mary, Kate }
How similar are these two lists? Note that I do not know all elements of the common superset.
Update:
I was unclear and I have probably used the word 'set' in a sloppy fashion. My apologies.
Clarification: Order is of importance.
If identical elements occupy the same position in the list, we have the highest similarity for that element.
The similarity decreased the farther apart the identical elements are.
The similarity is even lower if the element only exists in one of the lists.
I could even add the extra dimension that lower indices are of greater value, so a a[1] == b[1] is worth more than a[9] == b[9], but that is mainly cause I am curious.
The Jaccard Index (aka Tanimoto coefficient) is used precisely for the use case recited in the OP's question.
The Tanimoto coeff, tau, is equal to Nc divided by Na + Nb - Nc, or
tau = Nc / (Na + Nb - Nc)
Na, number of items in the first set
Nb, number of items in the second set
Nc, intersection of the two sets, or the number of unique items
common to both a and b
Here's Tanimoto coded as a Python function:
def tanimoto(x, y) :
w = [ ns for ns in x if ns not in y ]
return float(len(w) / (len(x) + len(y) - len(w)))
I would explore two strategies:
Treat the lists as sets and apply set ops (intersection, difference)
Treat the lists as strings of symbols and apply the Levenshtein algorithm
If you truly have sets (i.e., an element is simply either present or absent, with no count attached) and only two of them, just adding the number of shared elements and dividing by the total number of elements is probably about as good as it gets.
If you have (or can get) counts and/or more than two of them, you can do a bit better than that with something like cosine simliarity or TFIDF (term frequency * inverted document frequency).
The latter attempts to give lower weighting to words that appear in all (or nearly) all the "documents" -- i.e., sets of words.
What is your definition of "similarity measurement?" If all you want is how many items in the set are in common with each other, you could find the cardinality of A and B, add the cardinalities together, and subtract from the cardinality of the union of A and B.
If order matters you can use Levenshtein distance or other kind of Edit distance
.
Related
I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings?
I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two will be counted as well.
for eg:
ed('has a', 'has a ball') = 4 and ed('has a', 'has a ball the is round') = 15.
if we increase the length of the string, the edit distance will increase even though they are similar.
Therefore, I can not set a value, what a good edit distance value should be.
Yes, normalizing the edit distance is one way to put the differences between strings on a single scale from "identical" to "nothing in common".
A few things to consider:
Whether or not the normalized distance is a better measure of similarity between strings depends on the application. If the question is "how likely is this word to be a misspelling of that word?", normalization is a way to go. If it's "how much has this document changed since the last version?", the raw edit distance may be a better option.
If you want the result to be in the range [0, 1], you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.
The normalized distance is not a metric, as it violates the triangle inequality.
I used the following successfully:
len = std::max(s1.length(), s2.length());
// normalize by length, high score wins
fDist = float(len - levenshteinDistance(s1, s2)) / float(len);
Then chose the highest score. 1.0 means an exact match.
I had used a normalized edit distance or similarity (NES) which I think is very useful, defined by Daniel Lopresti and Jiangyin Zhou, in Equation (6) of their work: http://www.cse.lehigh.edu/~lopresti/Publications/1996/sdair96.pdf.
The NES in python is:
import math
def normalized_edit_similarity(m, d):
# d : edit distance between the two strings
# m : length of the shorter string
return ( 1.0 / math.exp( d / (m - d) ) )
print(normalized_edit_similarity(3, 0))
print(normalized_edit_similarity(3, 1))
print(normalized_edit_similarity(4, 1))
print(normalized_edit_similarity(5, 1))
print(normalized_edit_similarity(5, 2))
1.0
0.6065306597126334
0.7165313105737893
0.7788007830714049
0.513417119032592
More examples can be found in Table 2 in the above paper.
The variable m in the above function can be replaced with the length of the longer string, depending on your application.
So I have 2 lists of objects, with a position for each one. I would like to match every object from the first list with an object of the second list.
Once the object of the second list is selected for a match up, we remove it from the list (thus it can not be matched with another one). And most importantly, the total sum of distances between the matched up objects should be the least possible.
For example:
list1 { A, B, C } list2 { X, Y, Z }
So if I match up A->X (dist: 3meters) B->Z (dist: 2meters) C->Y (dist: 4meters)
Total sum = 3 + 2 + 4 = 9meters
We could have another match up with A->Y (4meters) B->X (1meter) C->Z (3meters)
Total sum = 4 + 1 + 3 = 8meters <======= Better solution
Thank you for your help.
Extra: Lists could have different length.
This problem is known as the Assignment Problem (a weighted matching in bipartite graphs).
An algorithm which solves this is the Hungarian algorithm. At the bottom of the wikipedia article is also a list of implementations.
If your data has special properties, like your two sets are 2D points and the weight of an edge is the euclidean distance, then there are better algorithms for this.
Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.
One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.
A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.
Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.
I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.
It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))
The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).
This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.
Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice
We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657
I have a symmetric matrix like shown in the image attached below.
I've made up the notation A.B which represents the value at grid point (A, B). Furthermore, writing A.B.C gives me the minimum grid point value like so: MIN((A,B), (A,C), (B,C)).
As another example A.B.D gives me MIN((A,B), (A,D), (B,D)).
My goal is to find the minimum values for ALL combinations of letters (not repeating) for one row at a time e.g for this example I need to find min values with respect to row A which are given by the calculations:
A.B = 6
A.C = 8
A.D = 4
A.B.C = MIN(6,8,6) = 6
A.B.D = MIN(6, 4, 4) = 4
A.C.D = MIN(8, 4, 2) = 2
A.B.C.D = MIN(6, 8, 4, 6, 4, 2) = 2
I realize that certain calculations can be reused which becomes increasingly important as the matrix size increases, but the problem is finding the most efficient way to implement this reuse.
Can point me in the right direction to finding an efficient algorithm/data structure I can use for this problem?
You'll want to think about the lattice of subsets of the letters, ordered by inclusion. Essentially, you have a value f(S) given for every subset S of size 2 (that is, every off-diagonal element of the matrix - the diagonal elements don't seem to occur in your problem), and the problem is to find, for each subset T of size greater than two, the minimum f(S) over all S of size 2 contained in T. (And then you're interested only in sets T that contain a certain element "A" - but we'll disregard that for the moment.)
First of all, note that if you have n letters, that this amounts to asking Omega(2^n) questions, roughly one for each subset. (Excluding the zero- and one-element subsets and those that don't include "A" saves you n + 1 sets and a factor of two, respectively, which is allowed for big Omega.) So if you want to store all these answers for even moderately large n, you'll need a lot of memory. If n is large in your applications, it might be best to store some collection of pre-computed data and do some computation whenever you need a particular data point; I haven't thought about what would work best, but for example computing data only for a binary tree contained in the lattice would not necessarily help you anything beyond precomputing nothing at all.
With these things out of the way, let's assume you actually want all the answers computed and stored in memory. You'll want to compute these "layer by layer", that is, starting with the three-element subsets (since the two-element subsets are already given by your matrix), then four-element, then five-element, etc. This way, for a given subset S, when we're computing f(S) we will already have computed all f(T) for T strictly contained in S. There are several ways that you can make use of this, but I think the easiest might be to use two such subset S: let t1 and t2 be two different elements of T that you may select however you like; let S be the subset of T that you get when you remove t1 and t2. Write S1 for S plus t1 and write S2 for S plus t2. Now every pair of letters contained in T is either fully contained in S1, or it is fully contained in S2, or it is {t1, t2}. Look up f(S1) and f(S2) in your previously computed values, then look up f({t1, t2}) directly in the matrix, and store f(T) = the minimum of these 3 numbers.
If you never select "A" for t1 or t2, then indeed you can compute everything you're interested in while not computing f for any sets T that don't contain "A". (This is possible because the steps outlined above are only interesting whenever T contains at least three elements.) Good! This leaves just one question - how to store the computed values f(T). What I would do is use a 2^(n-1)-sized array; represent each subset-of-your-alphabet-that-includes-"A" by the (n-1) bit number where the ith bit is 1 whenever the (i+1)th letter is in that set (so 0010110, which has bits 2, 4, and 5 set, represents the subset {"A", "C", "D", "F"} out of the alphabet "A" .. "H" - note I'm counting bits starting at 0 from the right, and letters starting at "A" = 0). This way, you can actually iterate through the sets in numerical order and don't need to think about how to iterate through all k-element subsets of an n-element set. (You do need to include a special case for when the set under consideration has 0 or 1 element, in which case you'll want to do nothing, or 2 elements, in which case you just copy the value from the matrix.)
Well, it looks simple to me, but perhaps I misunderstand the problem. I would do it like this:
let P be a pattern string in your notation X1.X2. ... .Xn, where Xi is a column in your matrix
first compute the array CS = [ (X1, X2), (X1, X3), ... (X1, Xn) ], which contains all combinations of X1 with every other element in the pattern; CS has n-1 elements, and you can easily build it in O(n)
now you must compute min (CS), i.e. finding the minimum value of the matrix elements corresponding to the combinations in CS; again you can easily find the minimum value in O(n)
done.
Note: since your matrix is symmetric, given P you just need to compute CS by combining the first element of P with all other elements: (X1, Xi) is equal to (Xi, X1)
If your matrix is very large, and you want to do some optimization, you may consider prefixes of P: let me explain with an example
when you have solved the problem for P = X1.X2.X3, store the result in an associative map, where X1.X2.X3 is the key
later on, when you solve a problem P' = X1.X2.X3.X7.X9.X10.X11 you search for the longest prefix of P' in your map: you can do this by starting with P' and removing one component (Xi) at a time from the end until you find a match in your map or you end up with an empty string
if you find a prefix of P' in you map then you already know the solution for that problem, so you just have to find the solution for the problem resulting from combining the first element of the prefix with the suffix, and then compare the two results: in our example the prefix is X1.X2.X3, and so you just have to solve the problem for
X1.X7.X9.X10.X11, and then compare the two values and choose the min (don't forget to update your map with the new pattern P')
if you don't find any prefix, then you must solve the entire problem for P' (and again don't forget to update the map with the result, so that you can reuse it in the future)
This technique is essentially a form of memoization.
If I have a set of values (which I'll call x), and a number of subsets of x:
What is the best way to work out all possible combinations of subsets whose union is equal to x, but none of whom intersect with each other.
An example might be:
if x is the set of the numbers 1 to 100, and I have four subsets:
a = 0-49
b = 50-100
c = 50-75
d = 76-100
then the possible combinations would be:
a + b
a + c + d
What you describe is called the Exact cover problem. The general solution is Knuth's Algorithm X, with the Dancing Links algorithm being a concrete implementation.
Given a well-order on the elements of x (make one up if necessary, this is always possible for finite or countable sets):
Let "sets chosen so far" be empty. Consider the smallest element of x. Find all sets which contain x and which do not intersect with any of the sets chosen so far. For each such set in turn recurse, adding the chosen set to "sets chosen so far", and looking at the smallest element of x not in any chosen set. If you reach a point where there is no element of x left, then you've found a solution. If you reach a point where there is no unchosen set containing the element you're looking for, and which does not intersect with any of the sets that you already have selected, then you've failed to find a solution, so backtrack.
This uses stack proportional to the number of non-intersecting subsets, so watch out for that. It also uses a lot of time - you can be far more efficient if, as in your example, the subsets are all contiguous ranges.
here's a bad way (recursive, does a lot of redundant work). But at least its actual code and is probably halfway to the "efficient" solution.
def unique_sets(sets, target):
if not sets and not target:
yield []
for i, s in enumerate(sets):
intersect = s.intersection(target) and not s.difference(target)
sets_without_s = sets[:i] + sets[i+1:]
if intersect:
for us in unique_sets(sets_without_s, target.difference(s)):
yield us + [s]
else:
for us in unique_sets(sets_without_s, target):
yield us
class named_set(set):
def __init__(self, items, name):
set.__init__(self, items)
self.name = name
def __repr__(self):
return self.name
a = named_set(range(0, 50), name='a')
b = named_set(range(50, 100), name='b')
c = named_set(range(50, 75), name='c')
d = named_set(range(75, 100), name='d')
for s in unique_sets([a,b,c,d], set(range(0, 100))):
print s
A way (may not be the best way) is:
Create a set of all the pairs of subsets which overlap.
For every combination of the original subsets, say "false" if the combination contains one or more of the pairs listed in Step 1, else say "true" if the union of the subsets equals x (e.g. if the total number of elements in the subsets is x)
The actual algorithm seems largely dependent on the choice of subsets, product operation, and equate operation. For addition (+), it seems like you could find a summation to suit your needs (the sum of 1 to 100 is similar to your a + b example). If you can do this, your algorithm is obviously O(1).
If you have a tougher product or equate operator (let's say taking a product of two terms means summing the strings and finding the SHA-1 hash), you may be stuck doing nested loops, which would be O(n^x) where x is the number of terms/variables.
Depending on the subsets you have to work with, it might be advantageous to use a more naive algorithm. One where you don't have to compare the entire subset, but only upper and lower bounds.
If you are talking random subsets, not necesserily a range, then Nick Johnson's suggestion will probably be the best choice.