Pairiwse jaccard similarity using minhash algorithm - performance

I am working with 200k sentences and I want to find Jaccard similarity using minhash algorithm. but it becomes really slow because of two for loops. could someone suggest me some good implementation?
Below is my current code
from datasketch.minhash import MinHash
def eg1(data1, data2):
m1 = MinHash()
m2 = MinHash(enter code here)
for d in data1:
m1.update(d.encode('utf8'))
for d in data2:
m2.update(d.encode('utf8'))
return m1.jaccard(m2)
jac_sim = []
for i_doc in range(len(shingles)-1):
for j_doc in range(i_doc + 1, len(shingles)):
jaccard_similarity = eg1(shingles[i_doc], shingles[j_doc])
jac_sim.append(jaccard_similarity)

The problem is that MinHash is calculated many times for the same input. By calculating MinHash signatures only once you should by able to save a lot of time:
signatures = []
for i_doc in range(len(shingles)):
m = MinHash()
for d in shingles[i_doc]:
m.update(d.encode('utf8'))
signatures.append(m)
jac_sim = []
for i_doc in range(len(shingles)-1):
for j_doc in range(i_doc + 1, len(shingles)):
jaccard_similarity = signatures[i_doc].jaccard(signatures[j_doc])
jac_sim.append(jaccard_similarity)

Related

Finding duplicate 2D arrays in Matlab 3D array

worldStates is a Matlab MxNxL 3D array (tensor) containing L states of a MxN grid of binary values.
ps is a length L list of probabilities associated with the different states.
The function [worldStates, ps] = StateMerge(worldStates, ps) should remove duplicate world states and sum the probabilities of the merged states to the single state that remains. Duplicate states are states with the exact same configuration of binary values.
Here is the current implementation of this function:
function [worldStates, ps] = StateMerge(worldStates, ps)
M = containers.Map;
for i = 1:length(ps)
s = worldStates(:,:,i);
s = mat2str(s);
if isKey(M, s)
M(s) = M(s) + ps(i);
else
M(s) = ps(i);
end
end
stringStates = keys(M);
n = length(stringStates);
sz = size(worldStates);
worldStates = zeros([sz(1:2), n]);
ps = zeros(1, 1, n);
for i = 1:n
worldStates(:,:,i) = eval(stringStates{i});
ps(i) = M(stringStates{i});
end
end
It uses a Map to be able to remove duplicates in O(L) time, using the states as keys and the probabilities as values. Since Matlab maps does not allow for general data structures as keys the states are converted into string representations to be used as keys and later converted back to arrays using the eval function.
It turns out this code is way to slow for my needs as i will want to process many states (magnitude ~10^6) many times (10^3). The problem lies in converting the matrix to a string which takes a substantial amount of time and scales poorly with state size. An example for small 25x25 states is given below:
How could i create keys in a more efficient manner? Is there another solution aside from using a map that would yield better results?
EDIT: Runnable code as requested. This example makes merges very unlikely:
worldStates = double(rand(25,25, 1000) > 0.5);
weights = rand(1,1, 1000);
ps = weights./sum(weights);
[worldStates, ps] = StateMerge(worldStates, ps);
In this example there will be lot's of merges:
worldStates = double(rand(25,25) > 0.5) .* ones(1,1,1000);
worldStates(1:2,1:2,:) = rand(2,2,1000) > 0.5;
weights = rand(1,1, 1000);
ps = weights./sum(weights);
[worldStates, ps] = StateMerge(worldStates, ps);
Use unique to extract unique (merged) states and accumarray to sum the probabilities of the merged states. Note that this solution, like your solution, doesn't preserve the order of the original states. As suggested by #Wolfie in comments you can use unique with 'stable' option to preserve the order of the states:
function [worldStates, ps] = StateMerge(worldStates, ps)
[M, N, L] = size (worldStates);
worldStates1 = reshape(worldStates, M*N, L).';
[~, uc, ui] = unique(worldStates1, 'rows');
ps = accumarray(ui, ps(:));
worldStates = worldStates (:, :, uc);
end

Find most unique words, penalizing words in common

suppose I have n classes like:
A: this,is,a,test,of,the,salmon,system
B: i,like,to,test,the,flounder,system
C: to,test,a,salmon,is,like,to,test,the,iodine,system
I want to get the most unique words for each class, so something with a ranking that gives me
A: salmon
B: flounder
C: iodine, salmon
(as their first elements ; it can be a ranking of all words)
How do I do this? There will be hundreds of input classes each with tens of thousands of tokens.
I'm guessing this is essentially the sort of thing any search engine back-end does, but I'd like a fairly simple standalone thing.
Using a language like Python, you can write this efficiently in 8 lines. For hundreds of groups, each with tens of thousands of tokens, the running time sounds like it will take at most a few minutes (although I haven't tried this on actual input).
Create a hash-based dictionary mapping each word to the number of its occurrences.
Iterate over all groups, and all words in a group, and update this dictionary.
For each group,
a. If you need a total ranking, sort with the value in the dictionary as the critera
b. If you need the top k, use an order statistics type of algorithm again using the value in the dictionary as the criteria
Steps 1 + 2 should have expected linear complexity in the total number of words.
Step 3 is n log(n) per group for total ranking, and linear in the total number of words otherwise.
Here is the Python code for the top k. Assume all_groups is a list of lists of strings, and that k = 10.
from collections import Counter
import heapq
import operator
c = Counter()
for g in all_groups:
c.update(g)
for g in all_groups:
print heapq.nsmallest(k, [(w, c[w]) for w in g], key=operator.itemgetter(1))
What I understand from your question, I come to this solution as the least used words per class comparing with all the other classes.
var a = "this,is,a,test,of,the,salmon,system".split(","),
b = "i,like,to,test,the,flounder,system".split(","),
c = "to,test,a,salmon,is,like,to,test,the,iodine,system".split(","),
map = {},
min,
key,
parse = function(stringArr) {
var length = stringArr.length,
i,count;
for (i = 0; i< length; i++) {
if (count = map[stringArr[i]]) {
map[stringArr[i]] = count + 1;
}
else {
map[stringArr[i]] = 1;
}
}
},
get = function(stringArr) {
min = Infinity;
stringArr.forEach((item)=>{
if (map[item] < min) {
min = map[item];
key = item
}
});
console.log(key);
};
parse(a);
parse(b);
parse(c);
get(a);
get(b);
get(c);
Ignore the classes, go through all the words and make a frequency table.
Then, for each class select the word with the lowest frequency.
Example in Python (slightly unpythonic solution to maintain readability for non-Python users):
a = "this,is,a,test,of,the,salmon,system".split(",")
b = "i,like,to,test,the,flounder,system".split(",")
c = "to,test,a,salmon,is,like,to,test,the,iodine,system".split(",")
freq = {}
for word in a + b + c:
freq[word] = (freq[word] if word in freq else 0) + 1
print("a: ", min(a, key=lambda w: freq[w]))
print("b: ", min(b, key=lambda w: freq[w]))
print("c: ", min(c, key=lambda w: freq[w]))

Dirichlet process in PyMC 3

I would like to implement to implement the Dirichlet process example referenced in
Implementing Dirichlet processes for Bayesian semi-parametric models (source: here) in PyMC 3.
In the example the stick-breaking probabilities are computed using the pymc.deterministic
decorator:
v = pymc.Beta('v', alpha=1, beta=alpha, size=N_dp)
#pymc.deterministic
def p(v=v):
""" Calculate Dirichlet probabilities """
# Probabilities from betas
value = [u*np.prod(1-v[:i]) for i,u in enumerate(v)]
# Enforce sum to unity constraint
value[-1] = 1-sum(value[:-1])
return value
z = pymc.Categorical('z', p, size=len(set(counties)))
How would you implement this in PyMC 3 which is using Theano for the gradient computation?
edit:
I tried the following solution using the theano.scan method:
with pm.Model() as mod:
conc = Uniform('concentration', lower=0.5, upper=10)
v = Beta('v', alpha=1, beta=conc, shape=n_dp)
p, updates = theano.scan(fn=lambda stick, idx: stick * t.prod(1 - v[:idx]),
outputs_info=None,
sequences=[v, t.arange(n_dp)])
t.set_subtensor(p[-1], 1 - t.sum(p[:-1]))
category = Categorical('category', p, shape=n_algs)
sd = Uniform('precs', lower=0, upper=20, shape=n_dp)
means = Normal('means', mu=0, sd=100, shape=n_dp)
points = Normal('obs',
means[category],
sd=sd[category],
observed=data)
step1 = pm.Slice([conc, v, sd, means])
step3 = pm.ElemwiseCategoricalStep(var=category, values=range(n_dp))
trace = pm.sample(2000, step=[step1, step3], progressbar=True)
Which sadly is really slow and does not obtain the original parameters of the synthetic data.
Is there a better solution and is this even correct?
Not sure I have a good answer but perhaps this could be sped up by instead using a theano blackbox op which allows you to write a distribution (or deterministic) in python code. E.g.: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/disaster_model_arbitrary_deterministic.py

Speeding up a nested for loop

I've been working on speeding up the following function, but with no results:
function beta = beta_c(k,c,gamma)
beta = zeros(size(k));
E = #(x) (1.453*x.^4)./((1 + x.^2).^(17/6));
for ii = 1:size(k,1)
for jj = 1:size(k,2)
E_int = integral(E,k(ii,jj),10000);
beta(ii,jj) = c*gamma/(k(ii,jj)*sqrt(E_int));
end
end
end
Up to now, I solved it this way:
function beta = beta_calc(k,c,gamma)
k_1d = reshape(k,[1,numel(k)]);
E_1d =#(k) 1.453.*k.^4./((1 + k.^2).^(17/6));
E_int = zeros(1,numel(k_1d));
parfor ii = 1:numel(k_1d)
E_int(ii) = quad(E_1d,k_1d(ii),10000);
end
beta_1d = c*gamma./(k_1d.*sqrt(E_int));
beta = reshape(beta_1d,[size(k,1),size(k,2)]);
end
Seems to me, it didn't really enhance performances. What do you think about this?
Would you mind to shed a light?
I thank you in advance.
EDIT
I am gonna introduce some theoretical background involving my question.
Generally, beta is to be calculated as follows
Therefore, in the reduced case of unidimensional k array, E_int may be calculated as
E = 1.453.*k.^4./((1 + k.^2).^(17/6));
E_int = 1.5 - cumtrapz(k,E);
or, alternatively as
E_int(1) = 1.5;
for jj = 2:numel(k)
E =#(k) 1.453.*k.^4./((1 + k.^2).^(17/6));
E_int(jj) = E_int(jj - 1) - integral(E,k(jj-1),k(jj));
end
Nonetheless, k is currently a matrix k(size1,size2).
Here's another approach, parallelize, because it's easy using spmd or parfor. Instead of integral consider quad, see this link for examples...
I like this question.
The problem: the function integral takes as integration limits only scalars. Hence, it is difficult to vectorize the computation of of E_int.
A clue: there seems to be lot of redundancy in integrating the same function over and over from k(ii,jj) to infinity...
Proposed solution: How about sorting the values of k from smallest to largest and integrating E_sort_int(si) = integral( E, sortedK(si), sortedK(si+1) ); with sortedK( numel(k) + 1 ) = 10000;. Then the full value of E_int = cumsum( E_sort_int ); (you only need to "undo" the sorting and reshape it back to the size of k).

Implementation of locality-sensitive hashing with min-hash

I have read a lot of tutorials, documents, and pieces of code implementing LSH (locality-sensitive hashing) with min-hash.
LSH tries to find the Jaccard coefficient of two sets by hashing random subsets and aggregating over those. I have looked at implementations in code.google.com but was not able to understand their method as well. I understand the paper Google news personalization: scalable online collaborative filtering, but I fail to understand any of the implementations out there.
Can someone please explain me in simple words how to implement LSH with MinHash?
You want to implement the min-hash algorithm but not LSH per se. Min-hashing is an LSH technique. Thus, LSH, in general, does not approximate the Jaccard coefficient, the particular method of min-hashing does.
An introduction is given in Mining of Massive Datasets, Chapter 3 by Anand Rajaraman and Jeff Ullman.
The min-hash representation of a set is an efficient means of estimating the Jaccard similarity, given as the relative number of shared hashes between the two min hash sets:
import random
def minhash():
d1 = set(random.randint(0, 2000) for _ in range(1000))
d2 = set(random.randint(0, 2000) for _ in range(1000))
jacc_sim = len(d1.intersection(d2)) / len(d1.union(d2))
print("jaccard similarity: {}".format(jacc_sim))
N_HASHES = 200
hash_funcs = []
for i in range(N_HASHES):
hash_funcs.append(universal_hashing())
m1 = [min([h(e) for e in d1]) for h in hash_funcs]
m2 = [min([h(e) for e in d2]) for h in hash_funcs]
minhash_sim = sum(int(m1[i] == m2[i]) for i in range(N_HASHES)) / N_HASHES
print("min-hash similarity: {}".format(minhash_sim))
def universal_hashing():
def rand_prime():
while True:
p = random.randrange(2 ** 32, 2 ** 34, 2)
if all(p % n != 0 for n in range(3, int((p ** 0.5) + 1), 2)):
return p
m = 2 ** 32 - 1
p = rand_prime()
a = random.randint(0, p)
if a % 2 == 0:
a += 1
b = random.randint(0, p)
def h(x):
return ((a * x + b) % p) % m
return h
if __name__ == "__main__":
minhash()

Resources