Grouping similar sets algorithm

Grouping similar sets algorithm - algorithm

I have a search engine. The search engine generates results when is searched for a keyword. What I need is to find all other keywords which generate similar results.
For example keyword k1 gives result set R1 = { 1,2,3,4,5,...40 }, which contains up to 40 document ids. And I need to get a list of all other keywords K1 which generate results similar to what k1 generates.
The similarity S(R1, R2) between two result sets R1 and R2 is computed as follows:
2 * (number of same elements both in _R1_ and _R2_) / ( (total number of elements in _R1_) + (total number of elements in _R2_) ). Example: R1 = {1,2,3} and R2 = {2,3,4,5} gives S(R1, R2) = (2*|{2,3}|) / |{1,2,3}| + |{2,3,4,5}| = (2*2)/(3+4) = 4/7 = 0.57.
There are more than 100,000 keywords thus more than 100,000 result sets. So far I only was able to solve this problem the hard way O(N^2), where each result set is comprated to every other set. This takes a lot of time.
Is there someone with a better idea?
Some similar post which not solve the problem completely:
How to store sets, to find similar patterns fast?
efficient algorithm to compare similarity between sets of numbers?

One question are the results in sorted order?
Something which came to mind combine both the sets , sort it and find duplicates. It can be reduced to O(nlogn)

To make the problem be simple, it is supposed that all the key words have 10 results ans k1 is the key word to be compared. You remove 9 results from the set of each key word. Now compare the last result with k1's and the key words with the same last result is what you want. If a key word has 1 result in common with k1, there is only 1% probability that it will remain. A key word with 5 results in common with k1 will have 25% probability to remain. Maybe you will think that 1% is too big, then you can repeat the process above n times and the key word with 1 result in common will have 1%^n probability to remain.
The time is O(N).

Is your similarity criterium fixed, or can we apply a bit of variety to achieve faster search engine?
Alternative:
An alternative that came to my mind:
Given your result set R1, you could go through the documents and create a histogram over other keywords that those documents would be matched to. Then, if given alternative keyword gets, say, at least #R1/2 hits, you list it as "similar".
The big difference is, that you do not consider documents that are not in R1 at all.
Exact?
If you need a solution exact to your requirements, I believe it would suffice to compute R2 set only for those keywords that satisfy the above "alternative" criterium. I think (mathematical proof needed!) that if the "alternative" criterium is not satisfied, there is no chance that yours will be.

Related

Algorithm to compare two arrays of element and find most similar pairs efficiently

I have two arrays of string of length m and n respectively, where the strings inside are all with length x, and I want to find the best matching pairs that contain the most number of common letter possible:
In a simple case, just consider these two strings
Sm = [AAAA, BBBB]
Sn = [ABBA, AAAA, AAAA, CCCC]
Expected results (2 pairs matched, 2 strings left alone):
Pair 1: AAAA -> AAAA because of score 4
Pair 2: BBBB -> ABBA because of score 2
Strings in Sn that are left alone:
AAAA because the same string in Sm has been matched already
CCCC because unable to match any
Score matrix:
My current method (Slow):
Get the string length x, which is the max score (the case where all letters are identical) - in this case it is 4
Brute force compare mxn times generate the score matrix above - in this case it is 2*4 times
Loop from x to 1: (In this case it is looping from 4 to 1)
Walk through the score matrix and pop the string pairs with score x
Mark remaining unpaired strings or strings with 0 score as alone
Question:
My current method is slow with O(mn) when producing the score matrix (x will not be large so I assume const here).
Is there any algorithm that can perform better than O(mn) complexity?

Sorry I don't yet have enough rep to simply provide a comment but in a project I wrote a long time ago I leveraged the Levenshtien Distance algorithm. Specifically see this project for some helpful insight.

As far as I can tell you are doing the most efficient thing. To be completely thorough you need to compare every string in Sn with every string in Sn, so at best the algorithm will be O(mn). Anything less would not be comparing every element to every element.
One optimization could be to remove all duplicates, but that would for the most part incur a performance hit that would likely cause more harm than good in almost all circumstances.

Searching in vector of pairs

I have a vector of pairs (datatype=double), where each pair is (a,b) and a less than b.For a number x, I want to find out number of pair in vector, where a<=x<=b.
Consider the vector size about 10^6.
My Approach
Sort the vector pair and perform a lower_bound operation for x over "a" in pair then iterate from start till my lower bound value and check for values of "b" which satisfies condition of x<=b.
Time Complexity
N(LogN) where N is vector size.
Issue
I have to perform this over large queries where this approach becomes inefficient.So is there any better solution to decrease the time complexity.
Sorry for my poor English and question formatting.

In addition to the previous answer, here's a suggestion how to prepare the ranges to optimize the subsequent lookup. The idea boils down to precomputing the result for all significantly different input values, but being smart about when values don't differ significantly.
To illustrate what I mean, let's consider this sequence of ranges:
1, 3
1, 8
2, 4
2, 6
The prepared output structure then looks like this:
1, 2 -> 2
2, 3 -> 4
3, 4 -> 3
4, 6 -> 2
6, 8 -> 1
For any number in the range 1, 2, there are two matching ranges in the initial sequence. For any number in the range 2, 3, there are four matches, etc. Note that there are five ranges here now, because some of the input ranges partially overlapped. Since for every range here the end value is also the start value of the next range, the end value can be optimized out. The result then looks like a simple map:
1 -> 2
2 -> 4
3 -> 3
4 -> 2
6 -> 1
8 -> 0
Note here that the last range didn't have one following, so the explicit zero becomes necessary. For the values before the first, that is implied. In order to find the result for a value, just find the key that is less than or equal to that value. This is a simple O(log n) lookup.

Firstly, if you just did a simple scan over the pairs, you would have O(n) complexity! The O(n log n) comes from sorting and for a one-off operation this is just overhead. This might even be the best way to do it, if you don't reuse the results and even if you just perform a few queries, it might still be better than sorting. Make sure you allow yourself to switch out the algorithm.
Anyhow, let's consider that you need to make many queries. Then, one relatively obvious step to improve things is to not iterate step-by-step after sorting. Instead, you can do a binary search for the lower bound. Simply partition the sequence into halves. The lower bound can be found in either half, which you can determine by looking at the middle element between the partitions. Recurse until you found the first element that can not possibly contain the value you search, because its start value is already greater.
Concerning the other direction, things are not that easy. Just because you sorted the ranges by the start value doesn't imply that the end values are sorted, too. Also, ranges that match and ranges that don't can be mixed in the sequence, so here you will have to perform a linear scan.
Lastly, some notes:
You could parallelize this algorithm using multithreading.
Depending on your number of searches M in your outer loop, you could also switch the outer loop with the inner one. That means that for every pair of the input vector, you check each of the M search values whether they fall within the range. This might be better, in particular when the M searches fit into the CPU cache.

This is a very typical style problem in for segment trees, binary indexed trees, interval trees.
There are two operations that you have to carry out on an array arr.
You have two operations on an array arr:
1. Range update: Add(a, b): for(int i = a; i <= b; ++i) arr[i]++
2. Point query : Query(x): return arr[x]
Alternately, you could formulate your problem slightly cleverly.
1. Point Update: Add(a, b): arr[a]++; arr[b+1]--;
2. Range Query: Query(x): return sum(arr[0], arr[1] ..... arr[x]);
In each of the cases above, you have one O(n) operation and one O(1) operation.
For the second case, the query is essentially a prefix sum calculation. Binary Indexed Trees are especially efficient at this task.
Tutorial for Binary Indexed Trees
IMPORTANT IDEA: ARRAY COMPRESSION
You did mention that the vector size is about 10^6, so there is a chance that you may not be able to create an array that big. If you are able to create a set that consists of all the as and bs and xs beforehand, then you can translate them into numbers from 1 to size of set.
SUPER CLEVER IDEA: MO's ALGORITHM
This is only allowed if you are allowed to solve the problem offline. What that means is that you can take all the query points x as input, solve them in any order as you like and store the solution, and then print the solution in the correct order.
Please mention if this is your situation, and only then will I elaborate further on this. But Binary Indexed Trees are going to be more efficient than Mo's algorithm.
EDIT:
Because your interval values are of type double, you must convert them to integers before you use my solution. Let me give an example,
Intervals = (1.1 to 1.9), (1.4 to 2.1)
Query Points = 1.5, 2.0
Here all the points that are of interest are not all the possible doubles, but just the above numbers = {1.1, 1.4, 1.5, 1.9, 2.0, 2.1}
If we map them into positive integers:
1.1 --> 1
1.4 --> 2
1.5 --> 3
1.9 --> 4
2.0 --> 5
2.1 --> 6
Then you could use segment trees/binary indexed trees.

For each pair a,b you can decompose so that a=+1 and b=-1 for the number of ranges valid for a particular value. Then in becomes a simple O(log n) lookup to see how many ranges encompass the search value.

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!

There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.

The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.

you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Where to start: what set of N letters makes the most words?

I'm having trouble coming up with a non-brute force approach to solve this problem I've been wondering: what set of N letters can be used to make the most words from a given dictionary? Letters can be used any number of times.
For example, for N=3, we can have EST to give words like TEST and SEE, etc...
Searching online, I found some answers (such as listed above for EST), but no description of the approach.
My question is: what well-known problems are similar to this, or what principles should I use to tackle this problem?
NOTE: I know it's not necessarily true that if EST is the best for N=3, then ESTx is the best for N=4. That is to say, you can't just append a letter to the previous solution.
In case you're wondering, this question came to mind because I was wondering what set of 4 ingredients could make the most cocktails, and I started searching for that. Then I realized my question was specific, and so I figured this letter question is the same type of problem, and started searching for it as well.

For each word in dictionary, sort it letters and remove duplicates. Let it be the skeleton of the word. For each skeleton, count how many words contain it. Let it be its frequency. Ignore all skeletons whose size is higher than N.
Let a subskeleton be any possible removals of 1 or more letters from the skeleton, i.e. EST has subskeletons of E,S,T,ES,ET,ST. For each skeleton of size N, add the count of this skeleton and all its subskeletons. Select the skeleton with maximal sum.
You need O(2**N*D) operations, where D is size of the dictionary.
Correction: we need to take into account all skeletons of size up to N (not only of words), and the numbet of operations will be O(2**N*C(L,N)), where L is the number of letters (maybe 26 in english).

So I coded up a solution to this problem that uses a hash table to get things done. I had to deal with a few problems along the way too!
Let N be the size of the group of letters you are looking for that can make the most words. Let L be the length of the dictionary.
Convert each word in the dictionary into a set of letters: 'test' -> {'e','s','t'}
For each number 1 to N inclusive, create a cut list that contains the words you can make with exactly that many letters.
Make a hash table for each number 1 to N inclusive, then go through the corresponding cut list and use the set as a key, and increment by 1 for each member of the cut list.
This was the part that gave me trouble! Create a set out of your cut list (unique_cut_list) for N. This is essentially all the populated key-value pairs for the hash table for N.
For each set in unique_cut_list, generate all subsets, and check the corresponding hash table (the size of the subset) to see if there is a value. If there is, add that value to the hash table for N with the key of the original set.
Finally, go through the hash table and find the max value. The corresponding key is the group of letters you're after.
You go through the dictionary 1+2N times for steps 1-5, step 6 goes through a version of the dictionary and check (2^N)-1 subsets each time (ignore null set). That gives O(2NL + L*2^N) which should approach O(L*2^N). Not bad, since N will not be too big in most applications!

scrabble solving with maximum score

I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?

First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).

Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))

Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.

If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.

Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.

Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio