Can anyone point to best algorithm for substring search in another string?
or search for a char array in another char array?
The best from what point of view? Knuth-Morris-Pratt is a good one. You can find more of them discussed on the Wikipedia entry for string searching algorithms.
It depends on what types of searching you are doing. Specific substring over a specific string? Specific substring over many different strings? Many different substrings over a specific string?
Here's a popular algorithm for a specific substring over many different strings.
Boyer-Moore algorithm: http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm
This strstr() implementation seems pretty slick.
Related
I am trying to find out if there is an algorithm that exists that is capable of the following:
Given a list of strings:
{"56B99Z", "78K80F", "50B49J", "28F11F"}
And given an input string of:
"??B?9?"
Then the algorithm should output:
{"56B99Z", "50B49J"}
Where ? are uknown characters.
I think some sort of trie-tree with additional links between nodes could work, but I don't want to re-invent the wheel if this has been done before.
Your question is really vague and you need to be more specific, are the strings have the same size? If so you can just look on the position which aren't question mark in your string you search for each other string, anyway if you looking for matching strings algorithms I suggest you read about kmp algorithm which have linear complexity for the given input => https://en.m.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
use a regular expression to match on the 1,2,4,6 positions as an \w
I know a similar question has been asked (Prefix vs Suffix Trie in String Matching) but the accepted answer did not help me understand my query.
The question is: What advantage does a suffix trie have over a prefix trie ?
Suffix tries allows you to choose the beginning of the string and look how long they match. It is probably similar to accepted answer on the original question but that's the best I can do.
You can try look at aho-corasick algorithm. It is a finite state machine and basically it uses a special prefix trie with failure links from the prefixes to the first occurence of the longest suffices in the trie. Basically it is a breadth-first search of the trie. AC is used in fast multiple pattern matching.
Is there any go module for doing fuzzy string matching ?. If i have an array of strings, i want to check whether a given string fuzzy matches with any of the elements in the array.
Please Help
Thank You
You are probably looking for a library which implements Levenshtein Distance algorithm.
This is also the algorithm used by elasticsearch fuzzy searching.
See here a list of Go packages.
I'm looking for an Algorithm (Preferably with a java implementation) for merging Strings.
my problem is as following :
suppose I have an Array/List of Strings {"myString1" , "my String1" , "my-String-1" ... }
I'd like the algorithm to point out that there is a very high probability that
all of these values denote the "myString1".
so I would like to compact my list.
maybe this can be done with KMP or maybe there is something more suitable.
Thanks.
I think that Edit distance is good heuristic for merging strings.
EDIT:
You can modify the edit distance algorithm:
You can give different value for d(-,c) for character c.
So in the following example: "String1","String2", you can "punish" the score but letting d(1,2) be high, in contrast to "String 1","String1" that won't be punished because the score will be d(-,' ').
Alternatively, Approximate string matching could be of some use. I dont believe KMP would suit the purpose, because it is designed for precise substring matching
This is an interview question: Find all (english word) substrings of a given string. (every = every, ever, very).
Obviously, we can loop over all substrings and check each one against an English dictionary, organized as a set. I believe the dictionary is small enough to fit the RAM. How to organize the dictionary ? As for as I remember, the original spell command loaded the words file in a bitmap, represented a set of words hash values. I would start from that.
Another solution is a trie built from the dictionary. Using the trie we can loop over all string characters and check the trie for each character. I guess the complexity of this solution would be the same in the worst case (O(n^2))
Does it make sense? Would you suggest other solutions?
The Aho-Corasick string matching algorithm which "constructs a finite state machine that resembles a trie with additional links between the various internal nodes."
But everything considered the "build a trie from the English dictionary and do a simultaneous search on it for all suffixes of the given string" should be pretty good for an interview.
I'm not sure a Trie will work easily to match sub words that begin in the middle of the string.
Another solution with a similar concept is to use a state machine or regular expression.
the regular expression is just word1|word2|....
I'm not sure if standard regular expression engines can handle an expression covering the whole English language, but it shouldn't be hard to build the equivalent state machine given the dictionary.
Once the regular expression is compiled \ the state machine is built the complexity of analyzing a specific string is O(n)
The first solution can be refined to have a different hash map for each word length (to reduce collisions) but other than that I can't think of anything significantly better.