Interview: How verify one string is include another string? - algorithm

General idea is that do two for loop, carry out every character from string 1, compare to every character from string2, if all finded, that will indicate Include.
so we need to loop all the char from string1, and compare all look all the character from string2, that will O sqaure runing time.
Which interviewer says it is not good idea.
after it, i am thinking for it. i cannot generate one idea that did not do two loop.
perhaps i can first get all the character from string1, convert into asc2, the number built into a tree. so when do the compare to the string2, it will make search very fast.
Or any folk has better idea?
Like string1 is abc but string2 is cbattt that means every character is included in string2.
not substring,

as iccthedral says, boyer moore is probably what the interviewer was looking for.

searching a text for a given pattern (pattern matching) is a very known problem. known solutions:
KMP
witness table
boyer-moore
suffix tree
all solutions vary in some minor aspects, like if it can be generalized for 2D pattern matching, or more. if it needs pre-processing, if it can be generalized for unbound alphabet, running time, etc'...
EDIT:
if you just want to know if all the letters of some string appear in some other string, why not use a table the size of your alphabet, indicating if a given char can be found in the string. if the alphabet is unbounded or extremely large (more than O(1)), use hash table.

Related

Matching a set of strings against a string to maximize the number of possible matches

I have a very interesting problem.
I have a set of strings and I would like to know how to best match a combination these strings in another string against a maximization function.
An example would be. Say I have the set:
['aabbcaa', 'bbc']
and I have the string
'fgabbcdaabbcaaef'
and the possible matches for this are:
fga[bbc]daadaa[bbc]aaef
or
fga[bbc]daad[aabbcaa]ef
Now, given a simple maximization function, I would say that fga[bbc]daad[aabbcaa]ef is the winner due to the number of total characters matched. A different maximization function could give more weight to larger words replaced instead of total characters.
I would love to know if someone could point me to some algos on how to do this. What I’m stumped by is after I find a set of potential matches I’m not sure how to maximize the set of words to choose in an efficient way.
The dictionary, the words of the dictionary, and the word that’s being matched against, could be of any size.
Would appreciate any help I could get with this. Thank you!
Found the answer and it works nicely. Pseudocode is:
Loop over the set and find everywhere the set strings match in the target string. Store the start_index, end_index, and give a score to that string for matching. I currently use the length of string.
Then using all the matches found, run it through the "Weighted Interval Scheduling" algorithm to find the optimal set of matches
https://courses.cs.washington.edu/courses/cse521/13wi/slides/06dp-sched.pdf

Comparing word to targeted words in dictionary

I'm trying to write a program in JAVA that stores a dictionary in a hashmap (each word under a different key) and compares a given word to the words in the dictionary and comes up with a spelling suggestion if it is not found in the dictionary -- basically a spell check program.
I already came up with the comparison algorithm (i.e. Needleman-Wunsch then Levenshtein distance), etc., but got stuck when it came figuring out what words in the dictionary-hashmap to compare the word to i.e. "hellooo".
I cannot compare "ohelloo" [should be corrected to "hello" to each word in the dictionary b/c that would take too long and I cannot compare it to all words int the dictionary starting with 'o' b/c it's supposed to be "hello".
Any ideas?
The most common spelling mistakes are
Delete a letter (smaller word OR word split)
Swap adjacent letters
Alter letter (QWERTY adjacent letters)
Insert letter
Some reports say that 70-90% of mistakes fall in the above categories (edit distance 1)
Take a look on the url below that provides a solution for single or double mistakes (edit distance 1 or 2). Almost everything you'll need is there!
How to write a spelling corrector
FYI: You can find implementation in various programming languages in the bottom of the aforementioned article. I've used it in some of my projects, practical accuracy is really good, sometimes more than 95+% as claimed by the author.
--Based on OP's comment--
If you don't want to pre-compute every possible alteration and then search on the map, I suggest that you use a patricia trie (radix tree) instead of a HashMap. Unfortunately, you will again need to handle the "first-letter mistake" (eg remove first letter or swap first with the second, or just replace it with a Qwerty adjacent) and you can limit your search with high probability.
You can even combine it with an extra Index Map or Trie with "reversed" words or an extra index that omits first N characters (eg first 2), so you can catch errors occurred on prefix only.

How to split a word into different ways such that it is a concatenation of two other words

I just came across this interesting question online and am quite stumped as to how to even progress on it.
Write a function that finds all the different ways you can split up a word into a
concatenation of two other words.
Is this something that Suffix Trees are used for?
I'm not looking for code, just conceptual way to move forward with this.
some psuedocode:
foreach place you can split the word:
split the word.
check if both sides are valid words.
If you are looking for a nice answer then please let us know your definition of a valid word.
Assuming a word is a string defined over an alphabet and has length greater than zero. You can use suffix trees.
Below is a simplified algorithm which will take just O(n) time.
Convert the word into a character array.
Traverse through the length of the array and for each i just take two strings (0 to i) and
(i+1 to length of the array-1).
Do remember to cover the base conditions like length greater than zero.
Total number of different ways to do it can be greater than one if and only if this condition holds:
-> one of the two words must be a multiple of other. For eg: "abcd" and "abcdabcd".
Using these two words u can form the string "abcdabcdabcdabcd" in many different ways.
So first check this condition.
Then check whether the string can be written from the two words in any way. Then simple math should give you the answer

Find all (english word) substrings of a given string

This is an interview question: Find all (english word) substrings of a given string. (every = every, ever, very).
Obviously, we can loop over all substrings and check each one against an English dictionary, organized as a set. I believe the dictionary is small enough to fit the RAM. How to organize the dictionary ? As for as I remember, the original spell command loaded the words file in a bitmap, represented a set of words hash values. I would start from that.
Another solution is a trie built from the dictionary. Using the trie we can loop over all string characters and check the trie for each character. I guess the complexity of this solution would be the same in the worst case (O(n^2))
Does it make sense? Would you suggest other solutions?
The Aho-Corasick string matching algorithm which "constructs a finite state machine that resembles a trie with additional links between the various internal nodes."
But everything considered the "build a trie from the English dictionary and do a simultaneous search on it for all suffixes of the given string" should be pretty good for an interview.
I'm not sure a Trie will work easily to match sub words that begin in the middle of the string.
Another solution with a similar concept is to use a state machine or regular expression.
the regular expression is just word1|word2|....
I'm not sure if standard regular expression engines can handle an expression covering the whole English language, but it shouldn't be hard to build the equivalent state machine given the dictionary.
Once the regular expression is compiled \ the state machine is built the complexity of analyzing a specific string is O(n)
The first solution can be refined to have a different hash map for each word length (to reduce collisions) but other than that I can't think of anything significantly better.

Finding partial substrings within a string

I have two strings which must be compared for similarity. The algorithm must be designed to find the maximal similarity. In this instance, the ordering matters, but intervening (or missing) characters do not. Edit distance cannot be used in this case for various reasons.
The situation is basically as follows:
string 1: ABCDEFG
string 2: AFENBCDGRDLFG
the resulting algorithm would find the substrings A, BCD, FG
I currently have a recursive solution, but because this must be run on massive amounts of data, any improvements would be greatly appreciated
Looking at your sole example it looks like you want to find longest common subsequence.
Take a look at LCS
Is it just me, or is this NP-hard? – David Titarenco (from comment)
If you want LCS of arbitrary number of strings its NP-hard. But it the number of input strings is constant ( as in this case, 2) this can be done in polynomial time.

Resources