I am trying to find out if there is an algorithm that exists that is capable of the following:
Given a list of strings:
{"56B99Z", "78K80F", "50B49J", "28F11F"}
And given an input string of:
"??B?9?"
Then the algorithm should output:
{"56B99Z", "50B49J"}
Where ? are uknown characters.
I think some sort of trie-tree with additional links between nodes could work, but I don't want to re-invent the wheel if this has been done before.
Your question is really vague and you need to be more specific, are the strings have the same size? If so you can just look on the position which aren't question mark in your string you search for each other string, anyway if you looking for matching strings algorithms I suggest you read about kmp algorithm which have linear complexity for the given input => https://en.m.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
use a regular expression to match on the 1,2,4,6 positions as an \w
Related
I have a very interesting problem.
I have a set of strings and I would like to know how to best match a combination these strings in another string against a maximization function.
An example would be. Say I have the set:
['aabbcaa', 'bbc']
and I have the string
'fgabbcdaabbcaaef'
and the possible matches for this are:
fga[bbc]daadaa[bbc]aaef
or
fga[bbc]daad[aabbcaa]ef
Now, given a simple maximization function, I would say that fga[bbc]daad[aabbcaa]ef is the winner due to the number of total characters matched. A different maximization function could give more weight to larger words replaced instead of total characters.
I would love to know if someone could point me to some algos on how to do this. What I’m stumped by is after I find a set of potential matches I’m not sure how to maximize the set of words to choose in an efficient way.
The dictionary, the words of the dictionary, and the word that’s being matched against, could be of any size.
Would appreciate any help I could get with this. Thank you!
Found the answer and it works nicely. Pseudocode is:
Loop over the set and find everywhere the set strings match in the target string. Store the start_index, end_index, and give a score to that string for matching. I currently use the length of string.
Then using all the matches found, run it through the "Weighted Interval Scheduling" algorithm to find the optimal set of matches
https://courses.cs.washington.edu/courses/cse521/13wi/slides/06dp-sched.pdf
General idea is that do two for loop, carry out every character from string 1, compare to every character from string2, if all finded, that will indicate Include.
so we need to loop all the char from string1, and compare all look all the character from string2, that will O sqaure runing time.
Which interviewer says it is not good idea.
after it, i am thinking for it. i cannot generate one idea that did not do two loop.
perhaps i can first get all the character from string1, convert into asc2, the number built into a tree. so when do the compare to the string2, it will make search very fast.
Or any folk has better idea?
Like string1 is abc but string2 is cbattt that means every character is included in string2.
not substring,
as iccthedral says, boyer moore is probably what the interviewer was looking for.
searching a text for a given pattern (pattern matching) is a very known problem. known solutions:
KMP
witness table
boyer-moore
suffix tree
all solutions vary in some minor aspects, like if it can be generalized for 2D pattern matching, or more. if it needs pre-processing, if it can be generalized for unbound alphabet, running time, etc'...
EDIT:
if you just want to know if all the letters of some string appear in some other string, why not use a table the size of your alphabet, indicating if a given char can be found in the string. if the alphabet is unbounded or extremely large (more than O(1)), use hash table.
I'm looking for an Algorithm (Preferably with a java implementation) for merging Strings.
my problem is as following :
suppose I have an Array/List of Strings {"myString1" , "my String1" , "my-String-1" ... }
I'd like the algorithm to point out that there is a very high probability that
all of these values denote the "myString1".
so I would like to compact my list.
maybe this can be done with KMP or maybe there is something more suitable.
Thanks.
I think that Edit distance is good heuristic for merging strings.
EDIT:
You can modify the edit distance algorithm:
You can give different value for d(-,c) for character c.
So in the following example: "String1","String2", you can "punish" the score but letting d(1,2) be high, in contrast to "String 1","String1" that won't be punished because the score will be d(-,' ').
Alternatively, Approximate string matching could be of some use. I dont believe KMP would suit the purpose, because it is designed for precise substring matching
This is an interview question: Find all (english word) substrings of a given string. (every = every, ever, very).
Obviously, we can loop over all substrings and check each one against an English dictionary, organized as a set. I believe the dictionary is small enough to fit the RAM. How to organize the dictionary ? As for as I remember, the original spell command loaded the words file in a bitmap, represented a set of words hash values. I would start from that.
Another solution is a trie built from the dictionary. Using the trie we can loop over all string characters and check the trie for each character. I guess the complexity of this solution would be the same in the worst case (O(n^2))
Does it make sense? Would you suggest other solutions?
The Aho-Corasick string matching algorithm which "constructs a finite state machine that resembles a trie with additional links between the various internal nodes."
But everything considered the "build a trie from the English dictionary and do a simultaneous search on it for all suffixes of the given string" should be pretty good for an interview.
I'm not sure a Trie will work easily to match sub words that begin in the middle of the string.
Another solution with a similar concept is to use a state machine or regular expression.
the regular expression is just word1|word2|....
I'm not sure if standard regular expression engines can handle an expression covering the whole English language, but it shouldn't be hard to build the equivalent state machine given the dictionary.
Once the regular expression is compiled \ the state machine is built the complexity of analyzing a specific string is O(n)
The first solution can be refined to have a different hash map for each word length (to reduce collisions) but other than that I can't think of anything significantly better.
Can anyone point to best algorithm for substring search in another string?
or search for a char array in another char array?
The best from what point of view? Knuth-Morris-Pratt is a good one. You can find more of them discussed on the Wikipedia entry for string searching algorithms.
It depends on what types of searching you are doing. Specific substring over a specific string? Specific substring over many different strings? Many different substrings over a specific string?
Here's a popular algorithm for a specific substring over many different strings.
Boyer-Moore algorithm: http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm
This strstr() implementation seems pretty slick.