Given a dictionary, find all possible letter orderings - algorithm

I was recently asked the following interview question:
You have a dictionary page written in an alien language. Assume that
the language is similar to English and is read/written from left to
right. Also, the words are arranged in lexicographic order. For
example the page could be: ADG, ADH, BCD, BCF, FM, FN
You have to give all lexicographic orderings possible of the character
set present in the page.
My approach is as follows:
A has higher precedence than B and G has higher precedence than H.
Therefore we have the information about ordering for some characters:
A->B, B->F, G->H, D->F, M->N
The possible orderings can be ABDFGNHMC, ACBDFGNHMC, ...
My approach was to use an array as position holder and generate all permutations to identify all valid orderings. The worst case time complexity for this is N! where N is the size of character set.
Can we do better than the brute force approach.
Thanks in advance.

Donald Knuth has written the paper A Structured Program to Generate all Topological Sorting Arrangements. This paper was originally pupblished in 1974. The following quote from the paper brought me to a better understanding of the problem (in the text the relation i < j stands for "i precedes j"):
A natural way to solve this problem is to let x1 be an
element having no predecessors, then to erase all relations of the
from x1 < j and to let x2 be an element ≠
x1 with no predecessors in the system as it now exists,
then to erase all relations of the from x2 < j , etc. It is
not difficult to verify that this method will always succeed unless
there is an oriented cycle in the input. Moreover, in a sense it is
the only way to proceed, since x1 must be an element
without predecessors, and x2 must be without predecessors
when all relations x1 < j are deleted, etc. This
observation leads naturally to an algorithm that finds all
solutions to the topological sorting problem; it is a typical example
of a "backtrack" procedure, where at every stage we consider a
subproblem of the from "Find all ways to complete a given partial
permutation x1x2...xk to a
topological sort x1x2...xn ." The
general method is to branch on all possible choices of
xk+1. A central problem in backtrack applications is
to find a suitable way to arrange the data so that it is easy to
sequence through the possible choices of xk+1 ; in this
case we need an efficient way to discover the set of all elements ≠
{x1,...,xk} which have no predecessors other
than x1,...,xk, and to maintain this knowledge
efficiently as we move from one subproblem to another.
The paper includes a pseudocode for a efficient algorithm. The time complexity for each output is O(m+n), where m ist the number of input relations and n is the number of letters. I have written a C++ program, that implements the algorithm described in the paper – maintaining variable and function names –, which takes the letters and relations from your question as input. I hope that nobody complains about giving the program to this answer – because of the language-agnostic tag.
#include <iostream>
#include <deque>
#include <vector>
#include <iterator>
#include <map>
// Define Input
static const char input[] =
{ 'A', 'D', 'G', 'H', 'B', 'C', 'F', 'M', 'N' };
static const char crel[][2] =
{{'A', 'B'}, {'B', 'F'}, {'G', 'H'}, {'D', 'F'}, {'M', 'N'}};
static const int n = sizeof(input) / sizeof(char);
static const int m = sizeof(crel) / sizeof(*crel);
std::map<char, int> count;
std::map<char, int> top;
std::map<int, char> suc;
std::map<int, int> next;
std::deque<char> D;
std::vector<char> buffer;
void alltopsorts(int k)
{
if (D.empty())
return;
char base = D.back();
do
{
char q = D.back();
D.pop_back();
buffer[k] = q;
if (k == (n - 1))
{
for (std::vector<char>::const_iterator cit = buffer.begin();
cit != buffer.end(); ++cit)
std::cout << (*cit);
std::cout << std::endl;
}
// erase relations beginning with q:
int p = top[q];
while (p >= 0)
{
char j = suc[p];
count[j]--;
if (!count[j])
D.push_back(j);
p = next[p];
}
alltopsorts(k + 1);
// retrieve relations beginning with q:
p = top[q];
while (p >= 0)
{
char j = suc[p];
if (!count[j])
D.pop_back();
count[j]++;
p = next[p];
}
D.push_front(q);
}
while (D.back() != base);
}
int main()
{
// Prepare
std::fill_n(std::back_inserter(buffer), n, 0);
for (int i = 0; i < n; i++) {
count[input[i]] = 0;
top[input[i]] = -1;
}
for (int i = 0; i < m; i++) {
suc[i] = crel[i][1]; next[i] = top[crel[i][0]];
top[crel[i][0]] = i; count[crel[i][1]]++;
}
for (std::map<char, int>::const_iterator cit = count.begin();
cit != count.end(); ++cit)
if (!(*cit).second)
D.push_back((*cit).first);
alltopsorts(0);
}

There is no algorithm that can do better than O(N!) if there are N! answers. But I think there is a better way to understand the problem:
You can build a directed graph in this way: if A appears before B, then there is an edge from A to B. After building the graph, you just need to find all possible topological sort results. Still O(N!), but easier to code and better than your approach (don't have to generate invalid ordering).

I would solve it like this:
Look at first letter: (A -> B -> F)
Look at second letter, but only account those who have same first letter: (D), (C), (M -> N)
Look at third letter, but only account those who have same 1. and 2. letter: (G -> H), (D -> F)
And so on, while it is something remaining... (Look at Nth letter, group by the previous letters)
What is in parentheses is all the information you get from set (all the possible orderings). Ignore parentheses with only one letter, because they do not represent ordering. Then take everthing in parentheses and topologically sort.

ok, i admit straight away that i don't have an estimate of time complexity for the average case, but maybe the following two observations will help.
first, this is an obvious candidate for a constraint library. if you were doing this in practice (like, it was some task at work) then you would get a constraint solver, give it the various pair-wise orderings you have, and then ask for a list of all results.
second, that is typically implemented as a search. if you have N characters consider a tree whose root node has N children (selection of the first character); next node has N-1 children (selection of second character); etc. clearly this is N! worst case for full exploration.
even with a "dumb" search, you can see that you can often prune searches by checking your order at any point against the pairs that you have.
but since you know that a total ordering exists, even though you (may) only have partial information, you can make the search more efficient. for example, you know that the first character must not appear to the "right" of < for any pair (if we assume that each character is given a numerical value, with the first character being lowest). similarly, moving down the tree, for the appropriately reduced data.
in short, you can enumerate possible solutions by exploring a tree, using the incomplete ordering information to constrain possible choices at each node.
hope that helps some.

Related

Proving that there are no overlapping sub-problems?

I just got the following interview question:
Given a list of float numbers, insert “+”, “-”, “*” or “/” between each consecutive pair of numbers to find the maximum value you can get. For simplicity, assume that all operators are of equal precedence order and evaluation happens from left to right.
Example:
(1, 12, 3) -> 1 + 12 * 3 = 39
If we built a recursive solution, we would find that we would get an O(4^N) solution. I tried to find overlapping sub-problems (to increase the efficiency of this algorithm) and wasn't able to find any overlapping problems. The interviewer then told me that there wasn't any overlapping subsolutions.
How can we detect when there are overlapping solutions and when there isn't? I spent a lot of time trying to "force" subsolutions to appear and eventually the Interviewer told me that there wasn't any.
My current solution looks as follows:
def maximumNumber(array, current_value=None):
if current_value is None:
current_value = array[0]
array = array[1:]
if len(array) == 0:
return current_value
return max(
maximumNumber(array[1:], current_value * array[0]),
maximumNumber(array[1:], current_value - array[0]),
maximumNumber(array[1:], current_value / array[0]),
maximumNumber(array[1:], current_value + array[0])
)
Looking for "overlapping subproblems" sounds like you're trying to do bottom up dynamic programming. Don't bother with that in an interview. Write the obvious recursive solution. Then memoize. That's the top down approach. It is a lot easier to get working.
You may get challenged on that. Here was my response the last time that I was asked about that.
There are two approaches to dynamic programming, top down and bottom up. The bottom up approach usually uses less memory but is harder to write. Therefore I do the top down recursive/memoize and only go for the bottom up approach if I need the last ounce of performance.
It is a perfectly true answer, and I got hired.
Now you may notice that tutorials about dynamic programming spend more time on bottom up. They often even skip the top down approach. They do that because bottom up is harder. You have to think differently. It does provide more efficient algorithms because you can throw away parts of that data structure that you know you won't use again.
Coming up with a working solution in an interview is hard enough already. Don't make it harder on yourself than you need to.
EDIT Here is the DP solution that the interviewer thought didn't exist.
def find_best (floats):
current_answers = {floats[0]: ()}
floats = floats[1:]
for f in floats:
next_answers = {}
for v, path in current_answers.iteritems():
next_answers[v + f] = (path, '+')
next_answers[v * f] = (path, '*')
next_answers[v - f] = (path, '-')
if 0 != f:
next_answers[v / f] = (path, '/')
current_answers = next_answers
best_val = max(current_answers.keys())
return (best_val, current_answers[best_val])
Generally the overlapping sub problem approach is something where the problem is broken down into smaller sub problems, the solutions to which when combined solve the big problem. When these sub problems exhibit an optimal sub structure DP is a good way to solve it.
The decision about what you do with a new number that you encounter has little do with the numbers you have already processed. Other than accounting for signs of course.
So I would say this is a over lapping sub problem solution but not a dynamic programming problem. You could use dive and conquer or evenmore straightforward recursive methods.
Initially let's forget about negative floats.
process each new float according to the following rules
If the new float is less than 1, insert a / before it
If the new float is more than 1 insert a * before it
If it is 1 then insert a +.
If you see a zero just don't divide or multiply
This would solve it for all positive floats.
Now let's handle the case of negative numbers thrown into the mix.
Scan the input once to figure out how many negative numbers you have.
Isolate all the negative numbers in a list, convert all the numbers whose absolute value is less than 1 to the multiplicative inverse. Then sort them by magnitude. If you have an even number of elements we are all good. If you have an odd number of elements store the head of this list in a special var , say k, and associate a processed flag with it and set the flag to False.
Proceed as before with some updated rules
If you see a negative number less than 0 but more than -1, insert a / divide before it
If you see a negative number less than -1, insert a * before it
If you see the special var and the processed flag is False, insert a - before it. Set processed to True.
There is one more optimization you can perform which is removing paris of negative ones as candidates for blanket subtraction from our initial negative numbers list, but this is just an edge case and I'm pretty sure you interviewer won't care
Now the sum is only a function of the number you are adding and not the sum you are adding to :)
Computing max/min results for each operation from previous step. Not sure about overall correctness.
Time complexity O(n), space complexity O(n)
const max_value = (nums) => {
const ops = [(a, b) => a+b, (a, b) => a-b, (a, b) => a*b, (a, b) => a/b]
const dp = Array.from({length: nums.length}, _ => [])
dp[0] = Array.from({length: ops.length}, _ => [nums[0],nums[0]])
for (let i = 1; i < nums.length; i++) {
for (let j = 0; j < ops.length; j++) {
let mx = -Infinity
let mn = Infinity
for (let k = 0; k < ops.length; k++) {
if (nums[i] === 0 && k === 3) {
// If current number is zero, removing division
ops.splice(3, 1)
dp.splice(3, 1)
continue
}
const opMax = ops[j](dp[i-1][k][0], nums[i])
const opMin = ops[j](dp[i-1][k][1], nums[i])
mx = Math.max(opMax, opMin, mx)
mn = Math.min(opMax, opMin, mn)
}
dp[i].push([mx,mn])
}
}
return Math.max(...dp[nums.length-1].map(v => Math.max(...v)))
}
// Tests
console.log(max_value([1, 12, 3]))
console.log(max_value([1, 0, 3]))
console.log(max_value([17,-34,2,-1,3,-4,5,6,7,1,2,3,-5,-7]))
console.log(max_value([59, 60, -0.000001]))
console.log(max_value([0, 1, -0.0001, -1.00000001]))

Convert string a to b using a dictionary of words

You have a dictionary of words and two strings a and b.
How can one convert a to b by changing only one character at a time and making sure that all the intermediate words are in the dictionary?
Example:
dictionary: {"cat", "bat", "hat", "bad", "had"}
a = "bat"
b = "had"
solution:
"bat" -> "bad" -> "had"
EDIT: The solutions given below propose building a graph from the dictionary words such that every word will have an edge to all other words differing by just one character.
This may be somewhat difficult if the dictionary is too big (let us say we are not talking about english language words only).
Also, even if this is acceptable, what is the best algorithm to create such a graph? Finding edges from a word to all other words would be O(n) where n is dictionary size. And total graph construction would be O(n2)? Any better algorithm?
This is not homework problem but an interview question.
You can think of this as a graph search problem. Each word is a node in the graph, and there is an edge between two words if they differ by exactly one letter. Running a BFS over this graph will then find the shortest path between your start word and the destination word (if it's possible to turn one word into the other) and will report that there is no way to do this otherwise.
Simply do a BFS over the graph whose nodes are the words and there is an edge between two nodes iff the words on the nodes differ by one letter. In this way, you could provide a solution by starting BFS from the start word given. If you reach the destination node, then it's possible, otherwise not.
You could also provide the steps taken and note that you would be providing the least number of steps to derive the required as a bonus.
P.S.: It's a coincidence that this question was asked to me too in an interview and I coded this solution!
How can one convert a to b by changing only one character at a time
and making sure that all the intermediate words are in the dictionary?
This is straight O(nm)
where n is number of words in the dictionary
and m is number of characters in the input word
The algorithm is simple, if the word from the dictionary mismatch the input by 1-character, consider it a solution:
FOR EACH WORD W IN DICTIONARY DO
IF SIZE(W) = SIZE(INPUT) THEN
MIS = 0
FOR i: 1..SIZE(INPUT) IF W[i] != INPUT[i] THEN MIS = MIS + 1
IF MIS = 1 THEN SOLUTION.ADD(W)
END-IF
END-FOR
Pre-build and re-use a travel map.
For example, build a scity[][] with valid word distance, that can be re-used.
Just a quick-exercise for job hunting, might be simplified.
#define SLEN 10
char* dict[SLEN]={
"bat",
"hat",
"bad",
"had",
"mad",
"tad",
"het",
"hep",
"hady",
"bap"};
int minD=0xfffff;
int edst(char *a, char *b)
{
char *ip=a,*op=b;
int d=0;
while((*ip)&&(*op))
if(*ip++!=*op++)
{
if(d) return 0;
d++;
}
if((*op)||(*ip)) d++;
return d;
}
int strlen(char *a)
{
char *ip=a;
int i=0;
while(*ip++)
i++;
return i;
}
int valid(char *dict[], int a, int b)
{
if((a==b)||(strlen(dict[a])!=strlen(dict[b]))||(edst(dict[a],dict[b])!=1)) return 0;
return 1;
}
void sroute(int scity[SLEN][SLEN], char* dict[], int a[], int end, int pos)
{
int i,j,d=0;
if(a[pos]==end)
{
for(i=pos;i<(SLEN-1);i++)
{
printf("%s ",dict[a[i]]);
d+=scity[a[i]][a[i+1]];
}
printf(" %s=%d\n",dict[a[SLEN-1]],d);
if(d<minD) minD=d;
return;
}
for(i=pos-2;i>=0;i--)
{
int b[SLEN];
for(j=0;j<SLEN;j++) b[j]=a[j];
b[pos-1]=a[i];
b[i]=a[pos-1];
if(scity[b[pos-1]][b[pos]]==1)
sroute(scity,dict,b,end,pos-1);
}
if(scity[a[pos-1]][a[pos]]==1) sroute(scity,dict,a,end,pos-1);
}
void initS(int scity[SLEN][SLEN], char* dict[], int a, int b)
{
int i,j;
int c[SLEN];
for(i=0;i<SLEN;i++)
for(j=0;j<SLEN;j++)
scity[i][j]=valid(dict,i,j);
for(i=0;i<SLEN;i++) c[i]=i;
c[SLEN-1]=b;
c[b]=SLEN-1;
sroute(scity, dict, c, a, SLEN-1);
printf("min=%d\n",minD);
}

Is it possible to rearrange an array in place in O(N)?

If I have a size N array of objects, and I have an array of unique numbers in the range 1...N, is there any algorithm to rearrange the object array in-place in the order specified by the list of numbers, and yet do this in O(N) time?
Context: I am doing a quick-sort-ish algorithm on objects that are fairly large in size, so it would be faster to do the swaps on indices than on the objects themselves, and only move the objects in one final pass. I'd just like to know if I could do this last pass without allocating memory for a separate array.
Edit: I am not asking how to do a sort in O(N) time, but rather how to do the post-sort rearranging in O(N) time with O(1) space. Sorry for not making this clear.
I think this should do:
static <T> void arrange(T[] data, int[] p) {
boolean[] done = new boolean[p.length];
for (int i = 0; i < p.length; i++) {
if (!done[i]) {
T t = data[i];
for (int j = i;;) {
done[j] = true;
if (p[j] != i) {
data[j] = data[p[j]];
j = p[j];
} else {
data[j] = t;
break;
}
}
}
}
}
Note: This is Java. If you do this in a language without garbage collection, be sure to delete done.
If you care about space, you can use a BitSet for done. I assume you can afford an additional bit per element because you seem willing to work with a permutation array, which is several times that size.
This algorithm copies instances of T n + k times, where k is the number of cycles in the permutation. You can reduce this to the optimal number of copies by skipping those i where p[i] = i.
The approach is to follow the "permutation cycles" of the permutation, rather than indexing the array left-to-right. But since you do have to begin somewhere, everytime a new permutation cycle is needed, the search for unpermuted elements is left-to-right:
// Pseudo-code
N : integer, N > 0 // N is the number of elements
swaps : integer [0..N]
data[N] : array of object
permute[N] : array of integer [-1..N] denoting permutation (used element is -1)
next_scan_start : integer;
next_scan_start = 0;
while (swaps < N )
{
// Search for the next index that is not-yet-permtued.
for (idx_cycle_search = next_scan_start;
idx_cycle_search < N;
++ idx_cycle_search)
if (permute[idx_cycle_search] >= 0)
break;
next_scan_start = idx_cycle_search + 1;
// This is a provable invariant. In short, number of non-negative
// elements in permute[] equals (N - swaps)
assert( idx_cycle_search < N );
// Completely permute one permutation cycle, 'following the
// permutation cycle's trail' This is O(N)
while (permute[idx_cycle_search] >= 0)
{
swap( data[idx_cycle_search], data[permute[idx_cycle_search] )
swaps ++;
old_idx = idx_cycle_search;
idx_cycle_search = permute[idx_cycle_search];
permute[old_idx] = -1;
// Also '= -idx_cycle_search -1' could be used rather than '-1'
// and would allow reversal of these changes to permute[] array
}
}
Do you mean that you have an array of objects O[1..N] and then you have an array P[1..N] that contains a permutation of numbers 1..N and in the end you want to get an array O1 of objects such that O1[k] = O[P[k]] for all k=1..N ?
As an example, if your objects are letters A,B,C...,Y,Z and your array P is [26,25,24,..,2,1] is your desired output Z,Y,...C,B,A ?
If yes, I believe you can do it in linear time using only O(1) additional memory. Reversing elements of an array is a special case of this scenario. In general, I think you would need to consider decomposition of your permutation P into cycles and then use it to move around the elements of your original array O[].
If that's what you are looking for, I can elaborate more.
EDIT: Others already presented excellent solutions while I was sleeping, so no need to repeat it here. ^_^
EDIT: My O(1) additional space is indeed not entirely correct. I was thinking only about "data" elements, but in fact you also need to store one bit per permutation element, so if we are precise, we need O(log n) extra bits for that. But most of the time using a sign bit (as suggested by J.F. Sebastian) is fine, so in practice we may not need anything more than we already have.
If you didn't mind allocating memory for an extra hash of indexes, you could keep a mapping of original location to current location to get a time complexity of near O(n). Here's an example in Ruby, since it's readable and pseudocode-ish. (This could be shorter or more idiomatically Ruby-ish, but I've written it out for clarity.)
#!/usr/bin/ruby
objects = ['d', 'e', 'a', 'c', 'b']
order = [2, 4, 3, 0, 1]
cur_locations = {}
order.each_with_index do |orig_location, ordinality|
# Find the current location of the item.
cur_location = orig_location
while not cur_locations[cur_location].nil? do
cur_location = cur_locations[cur_location]
end
# Swap the items and keep track of whatever we swapped forward.
objects[ordinality], objects[cur_location] = objects[cur_location], objects[ordinality]
cur_locations[ordinality] = orig_location
end
puts objects.join(' ')
That obviously does involve some extra memory for the hash, but since it's just for indexes and not your "fairly large" objects, hopefully that's acceptable. Since hash lookups are O(1), even though there is a slight bump to the complexity due to the case where an item has been swapped forward more than once and you have to rewrite cur_location multiple times, the algorithm as a whole should be reasonably close to O(n).
If you wanted you could build a full hash of original to current positions ahead of time, or keep a reverse hash of current to original, and modify the algorithm a bit to get it down to strictly O(n). It'd be a little more complicated and take a little more space, so this is the version I wrote out, but the modifications shouldn't be difficult.
EDIT: Actually, I'm fairly certain the time complexity is just O(n), since each ordinality can have at most one hop associated, and thus the maximum number of lookups is limited to n.
#!/usr/bin/env python
def rearrange(objects, permutation):
"""Rearrange `objects` inplace according to `permutation`.
``result = [objects[p] for p in permutation]``
"""
seen = [False] * len(permutation)
for i, already_seen in enumerate(seen):
if not already_seen: # start permutation cycle
first_obj, j = objects[i], i
while True:
seen[j] = True
p = permutation[j]
if p == i: # end permutation cycle
objects[j] = first_obj # [old] p -> j
break
objects[j], j = objects[p], p # p -> j
The algorithm (as I've noticed after I wrote it) is the same as the one from #meriton's answer in Java.
Here's a test function for the code:
def test():
import itertools
N = 9
for perm in itertools.permutations(range(N)):
L = range(N)
LL = L[:]
rearrange(L, perm)
assert L == [LL[i] for i in perm] == list(perm), (L, list(perm), LL)
# test whether assertions are enabled
try:
assert 0
except AssertionError:
pass
else:
raise RuntimeError("assertions must be enabled for the test")
if __name__ == "__main__":
test()
There's a histogram sort, though the running time is given as a bit higher than O(N) (N log log n).
I can do it given O(N) scratch space -- copy to new array and copy back.
EDIT: I am aware of the existance of an algorithm that will proceed through. The idea is to perform the swaps on the array of integers 1..N while at the same time mirroring the swaps on your array of large objects. I just cannot find the algorithm right now.
The problem is one of applying a permutation in place with minimal O(1) extra storage: "in-situ permutation".
It is solvable, but an algorithm is not obvious beforehand.
It is described briefly as an exercise in Knuth, and for work I had to decipher it and figure out how it worked. Look at 5.2 #13.
For some more modern work on this problem, with pseudocode:
http://www.fernuni-hagen.de/imperia/md/content/fakultaetfuermathematikundinformatik/forschung/berichte/bericht_273.pdf
I ended up writing a different algorithm for this, which first generates a list of swaps to apply an order and then runs through the swaps to apply it. The advantage is that if you're applying the ordering to multiple lists, you can reuse the swap list, since the swap algorithm is extremely simple.
void make_swaps(vector<int> order, vector<pair<int,int>> &swaps)
{
// order[0] is the index in the old list of the new list's first value.
// Invert the mapping: inverse[0] is the index in the new list of the
// old list's first value.
vector<int> inverse(order.size());
for(int i = 0; i < order.size(); ++i)
inverse[order[i]] = i;
swaps.resize(0);
for(int idx1 = 0; idx1 < order.size(); ++idx1)
{
// Swap list[idx] with list[order[idx]], and record this swap.
int idx2 = order[idx1];
if(idx1 == idx2)
continue;
swaps.push_back(make_pair(idx1, idx2));
// list[idx1] is now in the correct place, but whoever wanted the value we moved out
// of idx2 now needs to look in its new position.
int idx1_dep = inverse[idx1];
order[idx1_dep] = idx2;
inverse[idx2] = idx1_dep;
}
}
template<typename T>
void run_swaps(T data, const vector<pair<int,int>> &swaps)
{
for(const auto &s: swaps)
{
int src = s.first;
int dst = s.second;
swap(data[src], data[dst]);
}
}
void test()
{
vector<int> order = { 2, 3, 1, 4, 0 };
vector<pair<int,int>> swaps;
make_swaps(order, swaps);
vector<string> data = { "a", "b", "c", "d", "e" };
run_swaps(data, swaps);
}

Most common substring of length X

I have a string s and I want to search for the substring of length X that occurs most often in s. Overlapping substrings are allowed.
For example, if s="aoaoa" and X=3, the algorithm should find "aoa" (which appears 2 times in s).
Does an algorithm exist that does this in O(n) time?
You can do this using a rolling hash in O(n) time (assuming good hash distribution). A simple rolling hash would be the xor of the characters in the string, you can compute it incrementally from the previous substring hash using just 2 xors. (See the Wikipedia entry for better rolling hashes than xor.) Compute the hash of your n-x+1 substrings using the rolling hash in O(n) time. If there were no collisions, the answer is clear - if collisions happen, you'll need to do more work. My brain hurts trying to figure out if that can all be resolved in O(n) time.
Update:
Here's a randomized O(n) algorithm. You can find the top hash in O(n) time by scanning the hashtable (keeping it simple, assume no ties). Find one X-length string with that hash (keep a record in the hashtable, or just redo the rolling hash). Then use an O(n) string searching algorithm to find all occurrences of that string in s. If you find the same number of occurrences as you recorded in the hashtable, you're done.
If not, that means you have a hash collision. Pick a new random hash function and try again. If your hash function has log(n)+1 bits and is pairwise independent [Prob(h(s) == h(t)) < 1/2^{n+1} if s != t], then the probability that the most frequent x-length substring in s hash a collision with the <=n other length x substrings of s is at most 1/2. So if there is a collision, pick a new random hash function and retry, you will need only a constant number of tries before you succeed.
Now we only need a randomized pairwise independent rolling hash algorithm.
Update2:
Actually, you need 2log(n) bits of hash to avoid all (n choose 2) collisions because any collision may hide the right answer. Still doable, and it looks like hashing by general polynomial division should do the trick.
I don't see an easy way to do this in strictly O(n) time, unless X is fixed and can be considered a constant. If X is a parameter to the algorithm, then most simple ways of doing this will actually be O(n*X), as you will need to do comparison operations, string copies, hashes, etc., on a substring of length X at every iteration.
(I'm imagining, for a minute, that s is a multi-gigabyte string, and that X is some number over a million, and not seeing any simple ways of doing string comparison, or hashing substrings of length X, that are O(1), and not dependent on the size of X)
It might be possible to avoid string copies during scanning, by leaving everything in place, and to avoid re-hashing the entire substring -- perhaps by using an incremental hash algorithm where you can add a byte at a time, and remove the oldest byte -- but I don't know of any such algorithms that wouldn't result in huge numbers of collisions that would need to be filtered out with an expensive post-processing step.
Update
Keith Randall points out that this kind of hash is known as a rolling hash. It still remains, though, that you would have to store the starting string position for each match in your hash table, and then verify after scanning the string that all of your matches were true. You would need to sort the hashtable, which could contain n-X entries, based on the number of matches found for each hash key, and verify each result -- probably not doable in O(n).
It should be O(n*m) where m is the average length of a string in the list. For very small values of m then the algorithm will approach O(n)
Build a hashtable of counts for each string length
Iterate over your collection of strings, updating the hashtable accordingly, storing the current most prevelant number as an integer variable separate from the hashtable
done.
Naive solution in Python
from collections import defaultdict
from operator import itemgetter
def naive(s, X):
freq = defaultdict(int)
for i in range(len(s) - X + 1):
freq[s[i:i+X]] += 1
return max(freq.iteritems(), key=itemgetter(1))
print naive("aoaoa", 3)
# -> ('aoa', 2)
In plain English
Create mapping: substring of length X -> how many times it occurs in the s string
for i in range(len(s) - X + 1):
freq[s[i:i+X]] += 1
Find a pair in the mapping with the largest second item (frequency)
max(freq.iteritems(), key=itemgetter(1))
Here is a version I did in C. Hope that it helps.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
char *string = NULL, *maxstring = NULL, *tmpstr = NULL, *tmpstr2 = NULL;
unsigned int n = 0, i = 0, j = 0, matchcount = 0, maxcount = 0;
string = "aoaoa";
n = 3;
for (i = 0; i <= (strlen(string) - n); i++) {
tmpstr = (char *)malloc(n + 1);
strncpy(tmpstr, string + i, n);
*(tmpstr + (n + 1)) = '\0';
for (j = 0; j <= (strlen(string) - n); j++) {
tmpstr2 = (char *)malloc(n + 1);
strncpy(tmpstr2, string + j, n);
*(tmpstr2 + (n + 1)) = '\0';
if (!strcmp(tmpstr, tmpstr2))
matchcount++;
}
if (matchcount > maxcount) {
maxstring = tmpstr;
maxcount = matchcount;
}
matchcount = 0;
}
printf("max string: \"%s\", count: %d\n", maxstring, maxcount);
free(tmpstr);
free(tmpstr2);
return 0;
}
You can build a tree of sub-strings. The idea is to organise your sub-strings like a telephone book. You then look up the sub-string and increase its count by one.
In your example above, the tree will have sections (nodes) starting with the letters: 'a' and 'o'. 'a' appears three times and 'o' appears twice. So those nodes will have a count of 3 and 2 respectively.
Next, under the 'a' node a sub-node of 'o' will appear corresponding to the sub-string 'ao'. This appears twice. Under the 'o' node 'a' also appears twice.
We carry on in this fashion until we reach the end of the string.
A representation of the tree for 'abac' might be (nodes on the same level are separated by a comma, sub-nodes are in brackets, counts appear after the colon).
a:2(b:1(a:1(c:1())),c:1()),b:1(a:1(c:1())),c:1()
If the tree is drawn out it will be a lot more obvious! What this all says for example is that the string 'aba' appears once, or the string 'a' appears twice etc. But, storage is greatly reduced and more importantly retrieval is greatly speeded up (compare this to keeping a list of sub-strings).
To find out which sub-string is most repeated, do a depth first search of the tree, every time a leaf node is reached, note the count, and keep a track of the highest one.
The running time is probably something like O(log(n)) not sure, but certainly better than O(n^2).
Python-3 Solution:
from collections import Counter
list = []
list.append([string[i: j] for i in range(len(string)) for j in range(i + 1, len(string) + 1) if len(string[i:j]) == K]) # Where K is length
# now find the most common value in this list
# you can do this natively, but I prefer using collections
most_frequent = Counter(list).most_common(1)[0][0]
print(most_freqent)
Here is the native way to get the most common (for those that are interested):
most_occurences = 0
current_most = ""
for i in list:
frequency = list.count(i)
if frequency > most_occurences:
most_occurences = frequency
current_most = list[i]
print(f"{current_most}, Occurences: {most_occurences}")
[Extract K length substrings (geeks for geeks)][1]
[1]: https://www.geeksforgeeks.org/python-extract-k-length-substrings/
LZW algorithm does this
This is exactly what Lempel-Ziv-Welch (LZW used in GIF image format) compression algorithm does. It finds prevalent repeated bytes and changes them for something short.
LZW on Wikipedia
There's no way to do this in O(n).
Feel free to downvote me if you can prove me wrong on this one, but I've got nothing.

Fastest way to find most similar string to an input?

Given a query string Q of length N, and a list L of M sequences of length exactly N, what is the most efficient algorithm to find the string in L with the fewest mismatch positions to Q? For example:
Q = "ABCDEFG";
L = ["ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT"];
answer = L.query(Q); # Returns "ABCCEFG"
answer2 = L.query("AAAATAA"); #Returns "AAAAAAA".
The obvious way is to scan every sequence in L, making the search take O(M * N). Is there any way to do this in sublinear time? I don't care if there's a large upfront cost to organizing L into some data structure because it will be queried a lot of times. Also, handling tied scores arbitrarily is fine.
Edit: To clarify, I am looking for the Hamming distance.
All the answers except the one that mentions the best first algorithm are very much off.
Locally sensitive hashing is basically dreaming. This is the first time I see answers so much off on stackoverflow.
First, this is a hard, but standard problem that has been solved many years ago
in different ways.
One approach uses a trie such as the one preseted
by Sedgewick here:
http://www.cs.princeton.edu/~rs/strings/
Sedgewick also has sample C code.
I quote from the paper titled "Fast Algorithms for Sorting and Searching Strings" by Bentley and Sedgewick:
"‘‘Near neighbor’’ queries locate all words within a given Hamming distance
of a query word (for instance, code is distance 2 from soda). We give a new algorithm for near neighbor searching in strings, present a simple C implementation, and describe experiments on its efficiency."
A second approach is to use indexing. Split the strings into characters n-grams and index
with inverted index (google for Lucene spell checker to see how it's done).
Use the index to pull potential candidates and then run hamming distance or edit distnace on the candidates. This is the approach guaranteed to work best (and relatively simple).
A third appears in the area of speech recognition. There the query is a wav signal, and the database is a set of strings. There is a "table" that matches pieces of the signal to pieces of words. The goal is to find the best match of words to signal. This problem is known as word alignment.
In the problem posted, there is an implicit cost of matching query parts to database parts.
For example one may have different costs for deletion/insertion/substitution and even
different costs for mismatching say "ph" with "f".
The standard solution in speech recognition uses a dynamic programming approach which is made efficient via heuristics that direct pruning. In this way, only the best, say 50 candidates are kept. Thus, the name best-first search. In theory, you may not get the best match, but usually one gets a good match.
Here is a reference to the latter approach:
http://amta2010.amtaweb.org/AMTA/papers/2-02-KoehnSenellart.pdf
Fast Approximate String Matching with Suffix Arrays and A* Parsing.
This approach applies not only to words but to sentences.
Locality sensitive hashing underlies what seems to be the asymptotically best method known, as I understand it from this review article in CACM. Said article is pretty hairy and I didn't read it all. See also nearest neighbor search.
To relate these references to your problem: they all deal with a set of points in a metric space, such as an n-dimensional vector space. In your problem, n is the length of each string, and the values on each coordinate are the characters that can appear at each position in a string.
The "best" method will vary significantly depending on your input set and query set. Having a fixed message length will let you treat this problem in a classification context.
An information theoretic decision tree algorithm (like C4.5, for example) will provide the best overall guarantee on performance. In order to get optimal performance out of this method, you must first cluster the string indices into features based on mutual information. Note that you will need to modify the classifier to return all leaf nodes at the last branch, then compute a partial edit distance for each of them. The edit distance only needs to be calculated for the feature set represented by the last split of the tree.
Using this technique, querying should be ~ O(k log n), k << m, where k is the expectation of the feature size, m is the length of the string, and n is the number of comparison sequences.
The initial setup on this is guaranteed to be less than O(m^2 + n*t^2), t < m, t * k ~ m, where t is the feature count for an item. This is very reasonable and should not require any serious hardware.
These very nice performance numbers are possible because of the fixed m constraint. Enjoy!
I think you are looking for the Levenshtein edit distance.
There are a few questions here on SO about this already, I suppose you can find some good answers.
You could treat each sequence as an N-dimensional coordinate, chunk the resulting space into blocks that know what sequences occur in them, then on a lookup first search the search sequence's block and all contiguous blocks, then expand outward as necessary. (Maintaining several scopes of chunking is probably more desirable than getting into searching really large groups of blocks.)
Are you looking for the Hamming distance between the strings (i.e. the number of different characters at equivalent locations)?
Or does the distance "between" characters (e.g. difference between ASCII values of English letters) matter to you as well?
Some variety of best-first search on the target sequences will do much better than O(M * N). The basic idea of this is that you'd compare the first character in your candidate sequence with the first character of the target sequences, then in your second iteration only do the next-character comparison with the sequences that have the least number of mismatches, and so on. In your first example, you'd wind up comparing against ABCCEFG and AAAAAAA the second time, ABCCEFG only the third and fourth times, all the sequences the fifth time, and only ABCCEFG thereafter. When you get to the end of your candidate sequence, the set of target sequences with the lowest mismatch count is your match set.
(Note: at each step you're comparing against the next character for that branch of the search. None of the progressive comparisons skip characters.)
I can't think of a general, exact algorithm which will be less than O(N * M), but if you have a small enough M and N you can make an algorithm which performs as (N + M) using bit-parallel operations.
For example, if N and M are both less than 16, you could use a N * M lookup table of 64 bit ints ( 16*log2(16) = 64), and perform all operations in one pass through the string, where each group of 4 bits in the counter counts 0-15 for one of the string being matched. Obviously you need M log2(N+1) bits to store the counters, so might need to update multiple values for each character, but often a single pass lookup can be faster than other approaches. So it's actually O( N * M log(N) ), just with a lower constant factor - using 64 bit ints introduces a 1/64 into it, so should be better if log2(N) < 64. If M log2(N+1) < 64, it works out as (N+M) operations. But that's still linear, rather than sub-linear.
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
size_t match ( const char* string, uint64_t table[][128] ) ;
int main ()
{
const char* data[] = { "ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT" };
const size_t N = 7;
const size_t M = 4;
// prepare a table
uint64_t table[7][128] = { 0 };
for ( size_t i = 0; i < M; ++i )
for ( size_t j = 0; j < N; ++j )
table[j][ (size_t)data[i][j] ] |= 1 << (i * 4);
const char* examples[] = { "ABCDEFG", "AAAATAA", "TTAGQQT", "ZAAGVUT" };
for ( size_t i = 0; i < 4; ++i ) {
const char* q = examples[i];
size_t result = match ( q, table );
printf("Q(%s) -> %zd %s\n", q, result, data[result]);
}
}
size_t match ( const char* string, uint64_t table[][128] )
{
uint64_t count = 0;
// scan through string once, updating all counters at once
for ( size_t i = 0; string[i]; ++i )
count += table[i][ (size_t) string[i] ];
// find greatest sub-count within count
size_t best = 0;
size_t best_sub_count = count & 0xf;
for ( size_t i = 1; i < 4; ++i ) {
size_t sub_count = ( count >>= 4 ) & 0xf;
if ( sub_count > best_sub_count ) {
best_sub_count = sub_count;
best = i;
}
}
return best;
}
Sorry for bumping this old thread
To search elementwise would mean a complexity of O(M*N*N) - O(M) for searching and O(N*N) for calculating levenshtein distance.
The OP is looking for an efficient way to find the smallest hamming distance (c), not the string itself. If you have an upper bound on c (say X), you can find the smallest c in O(log(X)*M*N).
As Stefan pointed out, you can quickly find strings within a given hamming distance. This page http://blog.faroo.com/2015/03/24/fast-approximate-string-matching-with-large-edit-distances/ talks about one such way using Tries. Modify this to just test if there is such a string and binary search on c from 0 to X.
If up front cost don't matter you could calculate the best match for every possible input, and put the result in a hash map.
Of course this won't work if N isn't exremely small.

Resources