Is there an efficient solution to this coding interview question? - algorithm

We're given two strings that act as search queries. We need to determine if they're the same.
For example:
Query 1: stock price rate
Query 2: share cost rate
We're also given a list containing where each entry has two words that are synonyms. the words could be repeated meaning a transitive relation exists. Something like this:
[
[cost,price]
[rate,price]
[share,equity]
]
Goal is determine whether the queries mean the same thing.
I've proposed a solution where i group similar meaning words into lists and doing an exhaustive search until we find the word from query1 and then searching it's group for word from query 2. But the interviewer wanted a more efficient approach which i couldn't figure out. Is there a more efficient way to solve this issue?

Here is a solution that would allow to tell if 2 queries are similar in near constant time (O(size of queries)), with precomputing in O(number of words in database).
Precomputing: We assume that you have a list of lists of synonyms L
function build_hashmap(L):
H <- new Hashmap()
i <- 0
for each synonyms_list in L do:
for each word in synonyms_list do:
H[word] <- i
i <- i+1
return H
Now we can test if two words are synonyms using H
function is_synonym(w1, w2, H):
if H[w1] == H[w2]:
return true
else:
return False
From there it should be rather easy to tell if two queries have the same meaning.
Edit:
A fast solution could be to implement 'union-find' algorithm in order to build the hashmap.
Another way would be to first model the words as vertices of a graph, and to add edges for relations of synonymity.
Then you can build your hashmap by finding the connected components of the graph. Finding connected components in a graph can be done by traversing it.

Related

Algorithm to match sequential subset from a list

I am trying to remember the right algorithm to find a subset within a set that matches an element of a list of possible subsets. For example, given the input:
aehfaqptpzzy
and the subset list:
{ happy, sad, indifferent }
we can see that the word "happy" is a match because it is inside the input:
a e h f a q p t p z z y
I am pretty sure there is a specific algorithm to find all such matches, but I cannot remember what it is called.
UPDATE
The above example is not very good because it has letter repetitions, in fact in my problem both the dictionary entries and the input string are sortable sets. For example,
input: acegimnrqvy
dictionary:
{ cgn,
dfr,
lmr,
mnqv,
eg }
So in this example the algorithm would return cgn, mnqv and eg as matches. Also, I would like to find the best set of complementary matches where "best" means longest. So, in the example above the "best" answer would be "cgn mnqv", eg would not be a match because it conflicts with cgn which is a longer match.
I realize that the problem can be done by brute force scan, but that is undesirable because there could be thousands of entries in the dictionary and thousands of values in the input string. If we are trying to find the best set of matches, computability will become an issue.
You can use the Aho - Corrasick algorithm with more than one current states. For each of the input letters, you either stay (skip the letter) or move using the appropriate edge. If two or more "actors" meet at the same place, just merge them to one (if you're interested just in the presence and not counts).
About the complexity - this could be as slow as the naive O(MN) approach, because there can be up to size of dictionary actors. However, in practice, we can make a good use of the fact that many words are substrings of others, because there never won't be more than size of the trie actors, which - compared to the size of the dictionary - tends to be much smaller.

How to determine correspondence between two lists of names?

I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.

Grouping similar sets algorithm

I have a search engine. The search engine generates results when is searched for a keyword. What I need is to find all other keywords which generate similar results.
For example keyword k1 gives result set R1 = { 1,2,3,4,5,...40 }, which contains up to 40 document ids. And I need to get a list of all other keywords K1 which generate results similar to what k1 generates.
The similarity S(R1, R2) between two result sets R1 and R2 is computed as follows:
2 * (number of same elements both in _R1_ and _R2_) / ( (total number of elements in _R1_) + (total number of elements in _R2_) ). Example: R1 = {1,2,3} and R2 = {2,3,4,5} gives S(R1, R2) = (2*|{2,3}|) / |{1,2,3}| + |{2,3,4,5}| = (2*2)/(3+4) = 4/7 = 0.57.
There are more than 100,000 keywords thus more than 100,000 result sets. So far I only was able to solve this problem the hard way O(N^2), where each result set is comprated to every other set. This takes a lot of time.
Is there someone with a better idea?
Some similar post which not solve the problem completely:
How to store sets, to find similar patterns fast?
efficient algorithm to compare similarity between sets of numbers?
One question are the results in sorted order?
Something which came to mind combine both the sets , sort it and find duplicates. It can be reduced to O(nlogn)
To make the problem be simple, it is supposed that all the key words have 10 results ans k1 is the key word to be compared. You remove 9 results from the set of each key word. Now compare the last result with k1's and the key words with the same last result is what you want. If a key word has 1 result in common with k1, there is only 1% probability that it will remain. A key word with 5 results in common with k1 will have 25% probability to remain. Maybe you will think that 1% is too big, then you can repeat the process above n times and the key word with 1 result in common will have 1%^n probability to remain.
The time is O(N).
Is your similarity criterium fixed, or can we apply a bit of variety to achieve faster search engine?
Alternative:
An alternative that came to my mind:
Given your result set R1, you could go through the documents and create a histogram over other keywords that those documents would be matched to. Then, if given alternative keyword gets, say, at least #R1/2 hits, you list it as "similar".
The big difference is, that you do not consider documents that are not in R1 at all.
Exact?
If you need a solution exact to your requirements, I believe it would suffice to compute R2 set only for those keywords that satisfy the above "alternative" criterium. I think (mathematical proof needed!) that if the "alternative" criterium is not satisfied, there is no chance that yours will be.

Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.
You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.
I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.
You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.
Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.
Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.
Sounds like a fun problem, good luck!
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
parseWords(compoundWord)
# Make the graph
graph = makeGraph()
N = compoundWord.length
for index = 0 to N
graph.addVertex(i)
# Add the edges for each word
for index = 0 to N - 1
for length = 1 to min(N - index, MAX_WORD_LENGTH)
potentialWord = compoundWord.substr(index, length)
if dictionary.isElement(potentialWord)
graph.addEdge(index, index + length, 1)
# Now find a list of edges which define the shortest path
edges = graph.shortestPath(0, N)
# Change these edges back into words.
result = makeList()
for e in edges
result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
return result
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
PenIsland
KidsExchange
TherapistFinder
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.
Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:
accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes
Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:
from __future__ import with_statement
def opendict(dictionary=r"g:\words\words(3).txt"):
with open(dictionary, "r") as f:
return set(line.strip() for line in f)
if __name__ == '__main__':
s = opendict()
for word in sorted(s):
if len(word) >= 10:
for i in range(4, len(word)-4):
left, right = word[:i], word[i:]
if (left in s) and (right in s):
if right not in ('nesses', ):
print word, left, right
It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.
A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.
A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).
Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:
# python-ish pseudocode
def splitword(word):
# word is a character array indexed from 0..n-1
for i from 1 to n-1:
head = word[:i] # first i characters
tail = word[i:] # everything else
if is_word(head):
if i == n-1:
return [head] # this was the only valid word; return it as a 1-element list
else:
rest = splitword(tail)
if rest != []: # check whether we successfully split the tail into words
return [head] + rest
return [] # No successful split found, and 'word' is not a word.
Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.
Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.
This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).
We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:
ethylmethylketone
where the morphemes are:
ethyl methyl and ketone
We tackle this through automata and maximum entropy and the code is available on Sourceforge
http://www.sf.net/projects/oscar3-chem
but be warned that it will take some work.
We sometimes encounter ambiguity and are still finding a good way of reporting it.
To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.
As another example the string
weeknight
can be parsed as
wee knight
or
week night
Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).
I would use the following algorithm.
Start with the sorted list of words
to split, and a sorted list of
declined words (dictionary).
Create a result list of objects
which should store: remaining word
and list of matched words.
Fill the result list with the words
to split as remaining words.
Walk through the result array and
the dictionary concurrently --
always increasing the least of the
two, in a manner similar to the
merge algorithm. In this way you can
compare all the possible matching
pairs in one pass.
Any time you find a match, i.e. a
split words word that starts with a
dictionary word, replace the
matching dictionary word and the
remaining part in the result list.
You have to take into account
possible multiples.
Any time the remaining part is empty,
you found a final result.
Any time you don't find a match on
the "left side", in other words,
every time you increment the result
pointer because of no match, delete
the corresponding result item. This
word has no matches and can't be
split.
Once you get to the bottom of the
lists, you will have a list of
partial results. Repeat the loop
until this is empty -- go to point 4.

Ordering a dictionary to maximize common letters between adjacent words

This is intended to be a more concrete, easily expressable form of my earlier question.
Take a list of words from a dictionary with common letter length.
How to reorder this list tto keep as many letters as possible common between adjacent words?
Example 1:
AGNI, CIVA, DEVA, DEWA, KAMA, RAMA, SIVA, VAYU
reorders to:
AGNI, CIVA, SIVA, DEVA, DEWA, KAMA, RAMA, VAYU
Example 2:
DEVI, KALI, SHRI, VACH
reorders to:
DEVI, SHRI, KALI, VACH
The simplest algorithm seems to be: Pick anything, then search for the shortest distance?
However, DEVI->KALI (1 common) is equivalent to DEVI->SHRI (1 common)
Choosing the first match would result in fewer common pairs in the entire list (4 versus 5).
This seems that it should be simpler than full TSP?
What you're trying to do, is calculate the shortest hamiltonian path in a complete weighted graph, where each word is a vertex, and the weight of each edge is the number of letters that are differenct between those two words.
For your example, the graph would have edges weighted as so:
DEVI KALI SHRI VACH
DEVI X 3 3 4
KALI 3 X 3 3
SHRI 3 3 X 4
VACH 4 3 4 X
Then it's just a simple matter of picking your favorite TSP solving algorithm, and you're good to go.
My pseudo code:
Create a graph of nodes where each node represents a word
Create connections between all the nodes (every node connects to every other node). Each connection has a "value" which is the number of common characters.
Drop connections where the "value" is 0.
Walk the graph by preferring connections with the highest values. If you have two connections with the same value, try both recursively.
Store the output of a walk in a list along with the sum of the distance between the words in this particular result. I'm not 100% sure ATM if you can simply sum the connections you used. See for yourself.
From all outputs, chose the one with the highest value.
This problem is probably NP complete which means that the runtime of the algorithm will become unbearable as the dictionaries grow. Right now, I see only one way to optimize it: Cut the graph into several smaller graphs, run the code on each and then join the lists. The result won't be as perfect as when you try every permutation but the runtime will be much better and the final result might be "good enough".
[EDIT] Since this algorithm doesn't try every possible combination, it's quite possible to miss the perfect result. It's even possible to get caught in a local maximum. Say, you have a pair with a value of 7 but if you chose this pair, all other values drop to 1; if you didn't take this pair, most other values would be 2, giving a much better overall final result.
This algorithm trades perfection for speed. When trying every possible combination would take years, even with the fastest computer in the world, you must find some way to bound the runtime.
If the dictionaries are small, you can simply create every permutation and then select the best result. If they grow beyond a certain bound, you're doomed.
Another solution is to mix the two. Use the greedy algorithm to find "islands" which are probably pretty good and then use the "complete search" to sort the small islands.
This can be done with a recursive approach. Pseudo-code:
Start with one of the words, call it w
FindNext(w, l) // l = list of words without w
Get a list l of the words near to w
If only one word in list
Return that word
Else
For every word w' in l do FindNext(w', l') //l' = l without w'
You can add some score to count common pairs and to prefer "better" lists.
You may want to take a look at BK-Trees, which make finding words with a given distance to each other efficient. Not a total solution, but possibly a component of one.
This problem has a name: n-ary Gray code. Since you're using English letters, n = 26. The Wikipedia article on Gray code describes the problem and includes some sample code.

Resources