Finding all pairs of sequences that differ at exactly one position - algorithm

I need a data structure representing a set of sequences (all of the same, known, length) with the following non-standard operation:
Find two sequences in the set that differ at exactly one index. (Or establish that no such pair exists.)
If N is the length of the sequences and M the number of sequences, there is an obvious O(N*M*M) algorithm. I wonder if there is a standard way of solving this more efficiently. I'm willing to apply pre-processing if needed.
Bonus points if instead of returning a pair, the algorithm returns all sequences that differ at the same point.
Alternatively, I am also interested in a solution where I can check efficiently whether a particular sequence differs at one index with from any sequence in the set. If it helps, we can assume that in the set, no two sequences have that property.
Edit: you can assume N to be reasonably small. By this, I mean improvements such as O(log(N)*M*M) are not immediately useful for my use case.

For each sequence and each position i in that sequence, calculate a hash of the sequence without position i and add it to a hash table. If there is already an entry in the table, you have found a potential pair that differs only in one position. Using rolling hashes from both start and end and combining them, you can calculate each hash in constant time. The total running time is expected O(N*M).

Select j sets of k indexes each randomly (make sure none of the sets overlap).
For each set XOR the elements.
You now have j fingerprints for each document.
Compare sequences based on these fingerprints. j-1 fingerprints should match if the sequences do indeed match. But the converse might not be true and you might have to check location by location.
More clarification on comparison part: Sort all fingerprints from all documents (or use hash table). In that way you don't have to compare every pair, but only the pairs that do have a matching fingerprint.

A simple recursive approach:
Find all sets of sequences that have the same first half through sort or hash.
For each of these sets, repeat the whole process now only looking at the second half.
Find all sets of sequences that have the same second half through sort or hash.
For each of these sets, repeat the whole process now only looking at the first half.
When you've reached length 1, all those that don't match are what you're looking for.
Pseudo-code:
findPairs(1, N)
findPairs(set, start, end)
mid = (start + end)/2
sort set according to start and mid indices
if end - start == 1
last = ''
for each seq: set
if last != '' and seq != last
DONE - PAIR FOUND
last = seq
else
newSet = {}
last = ''
for each seq: set
if newSet.length > 1 and seq and last don't match from start to mid indices
findPairs(newSet, mid, end)
newSet = {}
newSet += seq
last = seq
It should be easy enough to modify the code to be able to find all pairs.
Complexity? I may be wrong but:
The max depth is log M. (I believe) the worst case would be if all sequences are identical. In this case the work done will be O(N*M*log M*log M), which is better than O(N*M*M).

Related

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Weighted unordered string edit distance

I need an efficient way of calculating the minimum edit distance between two unordered collections of symbols. Like in the Levenshtein distance, which only works for sequences, I require insertions, deletions, and substitutions with different per-symbol costs. I'm also interested in recovering the edit script.
Since what I'm trying to accomplish is very similar to calculating string edit distance, I figured it might be called unordered string edit distance or maybe just set edit distance. However, Google doesn't turn up anything with those search terms, so I'm interested to learn if the problem is known by another name?
To clarify, the problem would be solved by
def unordered_edit_distance(target, source):
return min(edit_distance(target, source_perm)
for source_perm in permuations(source))
So for instance, the unordered_edit_distance('abc', 'cba') would be 0, whereas edit_distance('abc', 'cba') is 2. Unfortunately, the number of permutations grows large very quickly and is not practical even for moderately sized inputs.
EDIT Make it clearer that operations are associated with different costs.
Sort them (not necessary), then remove items which are same (and in equal numbers!) in both sets.
Then if the sets are equal in size, you need that numer of substitutions; if one is greater, then you also need some insertions or deletions. Anyway you need the number of operations equal the size of the greater set remaining after the first phase.
Although your observation is kind of correct, but you are actually make a simple problem more complex.
Since source can be any permutation of the original source, you first need check the difference in character level.
Have two map each map count the number of individual characters in your target and source string:
for example:
a: 2
c: 1
d: 100
Now compare two map, if you missing any character of course you need to insert it, and if you have extra character you delete it. Thats it.
Let's ignore substitutions for a moment.
Now it becomes a fairly trivial problem of determining the elements only in the first set (which would count as deletions) and those only in the second set (which would count as insertions). This can easily be done by either:
Sorting the sets and iterating through both at the same time, or
Inserting each element from the first set into a hash table, then removing each element from the second set from the hash table, with each element not found being an insertion and each element remaining in the hash table after we're done being a deletion
Now, to include substitutions, all that remains is finding the optimal pairing of inserted elements to deleted elements. This is actually the stable marriage problem:
The stable marriage problem (SMP) is the problem of finding a stable matching between two sets of elements given a set of preferences for each element. A matching is a mapping from the elements of one set to the elements of the other set. A matching is stable whenever it is not the case that both:
Some given element A of the first matched set prefers some given element B of the second matched set over the element to which A is already matched, and
B also prefers A over the element to which B is already matched
Which can be solved with the Gale-Shapley algorithm:
The Gale–Shapley algorithm involves a number of "rounds" (or "iterations"). In the first round, first a) each unengaged man proposes to the woman he prefers most, and then b) each woman replies "maybe" to her suitor she most prefers and "no" to all other suitors. She is then provisionally "engaged" to the suitor she most prefers so far, and that suitor is likewise provisionally engaged to her. In each subsequent round, first a) each unengaged man proposes to the most-preferred woman to whom he has not yet proposed (regardless of whether the woman is already engaged), and then b) each woman replies "maybe" to her suitor she most prefers (whether her existing provisional partner or someone else) and rejects the rest (again, perhaps including her current provisional partner). The provisional nature of engagements preserves the right of an already-engaged woman to "trade up" (and, in the process, to "jilt" her until-then partner).
We just need to get the cost correct. To pair an insertion and deletion, making it a substitution, we'll lose both the cost of the insertion and the deletion, and gain the cost of the substitution, so the net cost of the pairing would be substitutionCost - insertionCost - deletionCost.
Now the above algorithm guarantees that all insertion or deletions gets paired - we don't necessarily want this, but there's an easy fix - just create a bunch of "stay-as-is" elements (on both the insertion and deletion side) - any insertion or deletion paired with a "stay-as-is" element would have a cost of 0 and would result in it remaining an insertion or deletion and nothing would happen for two "stay-as-is" elements ending up paired.
Key observation: you are only concerned with how many 'a's, 'b's, ..., 'z's or other alphabet characters are in your strings, since you can reorder all the characters in each string.
So, the problem boils down to the following: having s['a'] characters 'a', s['b'] characters 'b', ..., s['z'] characters 'z', transform them into t['a'] characters 'a', t['b'] characters 'b', ..., t['z'] characters 'z'. If your alphabet is short, s[] and t[] can be arrays; generally, they are mappings from the alphabet to integers, like map <char, int> in C++, dict in Python, etc.
Now, for each character c, you know s[c] and t[c]. If s[c] > t[c], you must remove s[c] - t[c] characters c from the first unordered string (s). If s[c] < t[c], you must add t[c] - s[c] characters c to the second unordered string (t).
Take X, the sum of s[c] - t[c] for all c such that s[c] > t[c], and you will get the number of characters you have to remove from s in total. Take Y, the sum of t[c] - s[c] for all c such that s[c] < t[c], and you will get the number of characters you have to remove from t in total.
Now, let Z = min (X, Y). We can have Z substitutions, and what's left is X - Z insertions and Y - Z deletions. Thus the total number of operations is Z + (X - Z) + (Y - Z), or X + Y - min (X, Y).

string transposition algorithm

Suppose there is given two String:
String s1= "MARTHA"
String s2= "MARHTA"
here we exchange positions of T and H. I am interested to write code which counts how many changes are necessary to transform from one String to another String.
There are several edit distance algorithms, the given Wikipeida link has links to a few.
Assuming that the distance counts only swaps, here is an idea based on permutations, that runs in linear time.
The first step of the algorithm is ensuring that the two strings are really equivalent in their character contents. This can be done in linear time using a hash table (or a fixed array that covers all the alphabet). If they are not, then s2 can't be considered a permutation of s1, and the "swap count" is irrelevant.
The second step counts the minimum number of swaps required to transform s2 to s1. This can be done by inspecting the permutation p that corresponds to the transformation from s1 to s2. For example, if s1="abcde" and s2="badce", then p=(2,1,4,3,5), meaning that position 1 contains element #2, position 2 contains element #1, etc. This permutation can be broke up into permutation cycles in linear time. The cycles in the example are (2,1) (4,3) and (5). The minimum swap count is the total count of the swaps required per cycle. A cycle of length k requires k-1 swaps in order to "fix it". Therefore, The number of swaps is N-C, where N is the string length and C is the number of cycles. In our example, the result is 2 (swap 1,2 and then 3,4).
Now, there are two problems here, and I think I'm too tired to solve them right now :)
1) My solution assumes that no character is repeated, which is not always the case. Some adjustment is needed to calculate the swap count correctly.
2) My formula #MinSwaps=N-C needs a proof... I didn't find it in the web.
Your problem is not so easy, since before counting the swaps you need to ensure that every swap reduces the "distance" (in equality) between these two strings. Then actually you look for the count but you should look for the smallest count (or at least I suppose), otherwise there exists infinite ways to swap a string to obtain another one.
You should first check which charaters are already in place, then for every character that is not look if there is a couple that can be swapped so that the next distance between strings is reduced. Then iterate over until you finish the process.
If you don't want to effectively do it but just count the number of swaps use a bit array in which you have 1 for every well-placed character and 0 otherwise. You will finish when every bit is 1.

Fast random selection algorithm

Given an array of true/false values, what is the most efficient algorithm to select an index with a true value at random.
A sketch simple algorithm is
a <- the array
c <- 0
for i in a:
if a[i] is true: c++
e <- random number in (0, c-1)
j <- 0
for i in e:
while j is false: j++
return j
Can anyone come up with a faster algorithm? Maybe there is a way to only walk through the list once even if the number of true elements is not known at first?
Use the "pick a random element from an infinite list" algorithm.
Keep an index of your current pick, and also a count of how many true values you've seen.
When you see a true value, increment the count and then replace your pick with the current index with a probability of P=(1/count). (So you always pick the first one you find... then you might switch to the second one, with probability 1/2, then you might switch to the third one with probabilty 1/3 etc.)
This requires only one scan over the list and constant storage. (It does require you to work out a larger number of random numbers, however.) In particular, it doesn't ever require you to either buffer the list or go back to the start - so it can work on an unbounded input stream.
See this answer for a sample LINQ implementation of the simple "pick a random element" algorithm; it would just need minor tweaks.
Build a list with indexes that point to true values and select one of those at random. Requires O(n) for list traversal and one try for the random number.

Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.
You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.
I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.
You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.
Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.
Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.
Sounds like a fun problem, good luck!
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
parseWords(compoundWord)
# Make the graph
graph = makeGraph()
N = compoundWord.length
for index = 0 to N
graph.addVertex(i)
# Add the edges for each word
for index = 0 to N - 1
for length = 1 to min(N - index, MAX_WORD_LENGTH)
potentialWord = compoundWord.substr(index, length)
if dictionary.isElement(potentialWord)
graph.addEdge(index, index + length, 1)
# Now find a list of edges which define the shortest path
edges = graph.shortestPath(0, N)
# Change these edges back into words.
result = makeList()
for e in edges
result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
return result
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
PenIsland
KidsExchange
TherapistFinder
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.
Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:
accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes
Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:
from __future__ import with_statement
def opendict(dictionary=r"g:\words\words(3).txt"):
with open(dictionary, "r") as f:
return set(line.strip() for line in f)
if __name__ == '__main__':
s = opendict()
for word in sorted(s):
if len(word) >= 10:
for i in range(4, len(word)-4):
left, right = word[:i], word[i:]
if (left in s) and (right in s):
if right not in ('nesses', ):
print word, left, right
It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.
A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.
A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).
Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:
# python-ish pseudocode
def splitword(word):
# word is a character array indexed from 0..n-1
for i from 1 to n-1:
head = word[:i] # first i characters
tail = word[i:] # everything else
if is_word(head):
if i == n-1:
return [head] # this was the only valid word; return it as a 1-element list
else:
rest = splitword(tail)
if rest != []: # check whether we successfully split the tail into words
return [head] + rest
return [] # No successful split found, and 'word' is not a word.
Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.
Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.
This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).
We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:
ethylmethylketone
where the morphemes are:
ethyl methyl and ketone
We tackle this through automata and maximum entropy and the code is available on Sourceforge
http://www.sf.net/projects/oscar3-chem
but be warned that it will take some work.
We sometimes encounter ambiguity and are still finding a good way of reporting it.
To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.
As another example the string
weeknight
can be parsed as
wee knight
or
week night
Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).
I would use the following algorithm.
Start with the sorted list of words
to split, and a sorted list of
declined words (dictionary).
Create a result list of objects
which should store: remaining word
and list of matched words.
Fill the result list with the words
to split as remaining words.
Walk through the result array and
the dictionary concurrently --
always increasing the least of the
two, in a manner similar to the
merge algorithm. In this way you can
compare all the possible matching
pairs in one pass.
Any time you find a match, i.e. a
split words word that starts with a
dictionary word, replace the
matching dictionary word and the
remaining part in the result list.
You have to take into account
possible multiples.
Any time the remaining part is empty,
you found a final result.
Any time you don't find a match on
the "left side", in other words,
every time you increment the result
pointer because of no match, delete
the corresponding result item. This
word has no matches and can't be
split.
Once you get to the bottom of the
lists, you will have a list of
partial results. Repeat the loop
until this is empty -- go to point 4.

Resources