substring search - algorithm

I've recently been trying to investigate the various ways to do substring searches and have stumbled upon the following article http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_string_search_algorithm. I was wondering if there are any other common/efficient algorithms out there that anyone can suggest/show?
Thanks much

The most obvious would be Boyer-Moore or some variant such as Boyer-Moore-Horspool. For some situations, it's also worth considering Knuth-Morris-Pratt.

KMP algorithim is efficient in substring search if the text is small.
complexity O(n).
for easy understanding
http://jakeboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/

In my opinion by far the most intuitave and easy to understand is Robin Karp Algorithm
Here is a simple python implementation
def computeHash(p):
return sum ([ value*10**index for (index,value) in enumerate(p[::-1]) ])
def getPosition(string,subString):
kh=computeHash(subString)
lk=len(subString)
ans=[]
for i in enumerate(string):
if len(string[i[0]:i[0]+lk])<lk:
break
else:
if computeHash(string[i[0]:i[0]+lk])==kh:
ans.append((i[0],i[0]+lk))
return ans
def main():
s="hello world" #string
ss="wor" #sub string
print getPosition(map(ord,s),map(ord,ss))
if __name__=="__main__":
main()

Related

Concise (one line?) binary search in Raku

Many common operations aren't built in to Raku because they can be concisely expressed with a combination of (meta) operators and/or functions. It feels like binary search of a sorted array ought to be expressable in that way (maybe with .rotor? or …?) but I haven't found a particularly good way to do so.
For example, the best I've come up with for searching a sorted array of Pairs is:
sub binary-search(#a, $target) {
when +#a ≤ 1 { #a[0].key == $target ?? #a[0] !! Empty }
&?BLOCK(#a[0..^*/2, */2..*][#a[*/2].key ≤ $target], $target)
}
That's not awful, but I can't shake the feeling that it could be an awfully lot better (both in terms of concision and readability). Can anyone see what elegant combo of operations I might be missing?
Here's one approach that technically meets my requirements (in that the function body it fits on a single normal-length line). [But see the edit below for an improved version.]
sub binary-search(#a, \i is copy = my $=0, :target($t)) {
for +#a/2, */2 … *≤1 {#a[i] cmp $t ?? |() !! return #a[i] with i -= $_ × (#a[i] cmp $t)}
}
# example usage (now slightly different, because it returns the index)
my #a = ((^20 .pick(*)) Z=> 'a'..*).sort;
say #a[binary-search(#a».key, :target(17))];
say #a[binary-search(#a».key, :target(1))];
I'm still not super happy with this code, because it loses a bit of readability – I still feel like there could/should be a concise way to do a binary sort that also clearly expresses the underlying logic. Using a 3-way comparison feels like it's on that track, but still isn't quite there.
[edit: After a bit more thought, I came up with an more readable version of the above using reduce.
sub binary-search(#a, :target(:$t)) {
(#a/2, */2 … *≤.5).reduce({ $^i - $^pt×(#a[$^i] cmp $t || return #a[$^i]) }) && Nil
}
In English, that reads as: for a sequence starting at the midpoint of the array and dropping by 1/2, move your index $^i by the value of the next item in the sequence – with the direction of the move determined by whether the item at that index is greater or lesser than the target. Continue until you find the target (in which case, return it) or you finish the sequence (which means the target wasn't present; return Nil)]

Need Help in solving Binary Search

I am trying to solve another Question
I have implemented the binary search function required to solve this problem but i am unable to return the correct value which would help me in solving this question.Here's my code.
Please help me in solving this question.
def BinarySearch(arr,low,high,search):
while(low<=high):
middle=(low+high)/2
int_mid=int(middle)
print(int_mid)
if arr[int_mid]==search:
return int_mid
elif arr[int_mid]>search:
low=int_mid+1
elif arr[int_mid]<search:
high=int_mid-1
return arr[int_mid]
tc=int(input())
while(tc>0):
size=int(input())
A=list(map(int,input().split()))
B=list(map(int,input().split()))
monkiness=[]
for i in range(len(A)):
for j in range(len(B)):
if (j>=i):
y=BinarySearch(B,0,len(B),B[j])
z=BinarySearch(A,0,len(A),A[i])
print(y,z)
if y>=z:
m=j-i
monkiness.append(m)
if len(monkiness)==0:
print(0)
else:
print(monkiness)
maxiumum=max(monkiness)
print(maxiumum)
tc-=1
You don't need binary search here ('here' means code you provided, not problem solution)
Even if you required binary search, BinarySearch is for non-decreasing array, but you have non-increasing array.
The goal is to find solution, that runs in less, than N^2 time, by exploiting the fact, that arrays are sorted.

Creating an anagram with a string of characters using Backtracking

I was looking at this solution here
Finding anagrams for a given word
Also if it helps here is the interview question
An anagram is a rearrangement of the letters in a given string into a sequence
of dictionary words, like Steven Skiena into Vainest Knees. Propose an algorithm
to construct all the anagrams of a given string.
What is the time complexity of this? Is this a valid backtracking algoirthm or is this more taking advantage of space to create a solution?
I am looking at the solution from Daniel, and that looks like it may take up a lot of space with a tree?
My real question though is, if I was asked this on an interview, what time of explanation do I give? A pseudo code answer or a real program?
A real solution probably takes too long to write in an interview. Here is how you can implement the back-tracking part. It's just pseudo-code:
function getAnagrams(str) {
helper('', str.replace(' ', '').toLowerCase().sort());
}
function helper(prefix, str) {
foreach (word in dictionary) {
if (word.isShuffleOf(str)) {
solution.add(prefix == '' ? word : prefix + ' ' + word);
} else if (str.contains(word)) {
helper(prefix == '' ? word : prefix + ' ' + word, str.subtract(word));
}
}
}

Breaking a string apart into a sequence of words

I recently came across the following interview question:
Given an input string and a dictionary of words, implement a method that breaks up the input string into a space-separated string of dictionary words that a search engine might use for "Did you mean?" For example, an input of "applepie" should yield an output of "apple pie".
I can't seem to get an optimal solution as far as complexity is concerned. Does anyone have any suggestions on how to do this efficiently?
Looks like the question is exactly my interview problem, down to the example I used in the post at The Noisy Channel. Glad you liked the solution. Am quite sure you can't beat the O(n^2) dynamic programming / memoization solution I describe for worst-case performance.
You can do better in practice if your dictionary and input aren't pathological. For example, if you can identify in linear time the substrings of the input string are in the dictionary (e.g., with a trie) and if the number of such substrings is constant, then the overall time will be linear. Of course, that's a lot of assumptions, but real data is often much nicer than a pathological worst case.
There are also fun variations of the problem to make it harder, such as enumerating all valid segmentations, outputting a best segmentation based on some definition of best, handling a dictionary too large to fit in memory, and handling inexact segmentations (e.g., correcting spelling mistakes). Feel free to comment on my blog or otherwise contact me to follow up.
This link describes this problem as a perfect interview question and provides several methods to solve it. Essentially it involves recursive backtracking. At this level it would produce an O(2^n) complexity. An efficient solution using memoization could bring this problem down to O(n^2).
Using python, we can write two function, the first one segment returns the first segmentation of a piece of contiguous text into words given a dictionary or None if no such segmentation is found. Another function segment_all returns a list of all segmentations found. Worst case complexity is O(n**2) where n is the input string length in chars.
The solution presented here can be extended to include spelling corrections and bigram analysis to determine the most likely segmentation.
def memo(func):
'''
Applies simple memoization to a function
'''
cache = {}
def closure(*args):
if args in cache:
v = cache[args]
else:
v = func(*args)
cache[args] = v
return v
return closure
def segment(text, words):
'''
Return the first match that is the segmentation of 'text' into words
'''
#memo
def _segment(text):
if text in words: return text
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix = _segment(suffix)
if prefix in words and segmented_suffix:
return '%s %s' % (prefix, segmented_suffix)
return None
return _segment(text)
def segment_all(text, words):
'''
Return a full list of matches that are the segmentation of 'text' into words
'''
#memo
def _segment(text):
matches = []
if text in words:
matches.append(text)
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix_matches = _segment(suffix)
if prefix in words and len(segmented_suffix_matches):
for match in segmented_suffix_matches:
matches.append('%s %s' % (prefix, match))
return matches
return _segment(text)
if __name__ == "__main__":
string = 'cargocultscience'
words = set('car cargo go cult science'.split())
print segment(string, words)
# >>> car go cult science
print segment_all(string, words)
# >>> ['car go cult science', 'cargo cult science']
One option would be to store all valid English words in a trie. Once you've done this, you could start walking the trie from the root downward, following the letters in the string. Whenever you find a node that's marked as a word, you have two options:
Break the input at this point, or
Continue extending the word.
You can claim that you've found a match once you have broken the input up into a set of words that are all legal and have no remaining characters left. Since at each letter you either have one forced option (either you are building a word that isn't valid and should stop -or- you can keep extending the word) or two options (split or keep going), you could implement this function using exhaustive recursion:
PartitionWords(lettersLeft, wordSoFar, wordBreaks, trieNode):
// If you walked off the trie, this path fails.
if trieNode is null, return.
// If this trie node is a word, consider what happens if you split
// the word here.
if trieNode.isWord:
// If there is no input left, you're done and have a partition.
if lettersLeft is empty, output wordBreaks + wordSoFar and return
// Otherwise, try splitting here.
PartitinWords(lettersLeft, "", wordBreaks + wordSoFar, trie root)
// Otherwise, consume the next letter and continue:
PartitionWords(lettersLeft.substring(1), wordSoFar + lettersLeft[0],
wordBreaks, trieNode.child[lettersLeft[0])
In the pathologically worst case this will list all partitions of the string, which can t exponentially long. However, this only occurs if you can partition the string in a huge number of ways that all start with valid English words, and is unlikely to occur in practice. If the string has many partitions, we might spend a lot of time finding them, though. For example, consider the string "dotheredo." We can split this many ways:
do the redo
do the red o
doth ere do
dot here do
dot he red o
dot he redo
To avoid this, you might want to institute a limit of the number of answers you report, perhaps two or three.
Since we cut off the recursion when we walk off the trie, if we ever try a split that doesn't leave the remainder of the string valid, we will detect this pretty quickly.
Hope this helps!
import java.util.*;
class Position {
int indexTest,no;
Position(int indexTest,int no)
{
this.indexTest=indexTest;
this.no=no;
} } class RandomWordCombo {
static boolean isCombo(String[] dict,String test)
{
HashMap<String,ArrayList<String>> dic=new HashMap<String,ArrayList<String>>();
Stack<Position> pos=new Stack<Position>();
for(String each:dict)
{
if(dic.containsKey(""+each.charAt(0)))
{
//System.out.println("=========it is here");
ArrayList<String> temp=dic.get(""+each.charAt(0));
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
else
{
ArrayList<String> temp=new ArrayList<String>();
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
}
Iterator it = dic.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println("key: "+pair.getKey());
for(String str:(ArrayList<String>)pair.getValue())
{
System.out.print(str);
}
}
pos.push(new Position(0,0));
while(!pos.isEmpty())
{
Position position=pos.pop();
System.out.println("position index: "+position.indexTest+" no: "+position.no);
if(dic.containsKey(""+test.charAt(position.indexTest)))
{
ArrayList<String> strings=dic.get(""+test.charAt(position.indexTest));
if(strings.size()>1&&position.no<strings.size()-1)
pos.push(new Position(position.indexTest,position.no+1));
String str=strings.get(position.no);
if(position.indexTest+str.length()==test.length())
return true;
pos.push(new Position(position.indexTest+str.length(),0));
}
}
return false;
}
public static void main(String[] st)
{
String[] dic={"world","hello","super","hell"};
System.out.println("is 'hellworld' a combo: "+isCombo(dic,"superman"));
} }
I have done similar problem. This solution gives true or false if given string is combination of dictionary words. It can be easily converted to get space-separated string. Its average complexity is O(n), where n: no of dictionary words in given string.

Recursively Searching in a Tree

I'm working with the Stanford Parser in ruby and want to search for all nodes of a tree with a particular label name.
This is the recursive method i have coded so far
def searchTreeWithLabel(tree,lablename,listOfNodes)
if tree.instance_of?(StanfordParser::Tree)
if tree.lable.toString == lablename then
listOfNodes << tree
else
tree.children.each { |c| searchTreeWithLabel(c, lablename, listOfNodes)}
end
end
listOfNodes
end
i want the method to return a list of Tree nodes that have the label as labelname
I'm not familiar StanfordParser but I imagine you need to take the descent part of the traversal out of the inner conditional and always do it.
Also, did they really implement a toString method? Seriously? It's not .to_s? I mean, I enjoyed Java, before I found Ruby... :-)
my original code was correct but ruby was having some problem with the
if tree.lable.toString == lablename then
statement, turns out tree.value works as well, so now i'm checking
if tree.value == lablename then
and it works.

Resources