Feasibility of a bit modified version of Rabin Karp algorithm - algorithm

I am trying to implement a bit modified version of Rabin Karp algorithm. My idea is if I get a hash value of the given pattern in terms of weight associated with each letter, then I don't have to worry about anagrams so I can just pick up a part of the string, calculate its hash value and compare with hash value of the pattern unlike traditional approach where hashvalue of both part of string and pattern is calculated and then checked whether they are actually similar or it could be an anagram. Here is my code below
string = "AABAACAADAABAABA"
pattern = "AABA"
#string = "gjdoopssdlksddsoopdfkjdfoops"
#pattern = "oops"
#get hash value of the pattern
def gethashp(pattern):
sum = 0
#I mutiply each letter of the pattern with a weight
#So for eg CAT will be C*1 + A*2 + T*3 and the resulting
#value wil be unique for the letter CAT and won't match if the
#letters are rearranged
for i in range(len(pattern)):
sum = sum + ord(pattern[i]) * (i + 1)
return sum % 101 #some prime number 101
def gethashst(string):
sum = 0
for i in range(len(string)):
sum = sum + ord(string[i]) * (i + 1)
return sum % 101
hashp = gethashp(pattern)
i = 0
def checkMatch(string,pattern,hashp):
global i
#check if we actually get first four strings(comes handy when you
#are nearing the end of the string)
if len(string[:len(pattern)]) == len(pattern):
#assign the substring to string2
string2 = string[:len(pattern)]
#get the hash value of the substring
hashst = gethashst(string2)
#if both the hashvalue matches
if hashst == hashp:
#print the index of the first character of the match
print("Pattern found at {}".format(i))
#delete the first character of the string
string = string[1:]
#increment the index
i += 1 #keep a count of the index
checkMatch(string,pattern,hashp)
else:
#if no match or end of string,return
return
checkMatch(string,pattern,hashp)
The code is working just fine. My question is this a valid way of doing it? Can there be any instance where the logic might fail? All the Rabin Karp algorithms that I have come across doesn't use this logic instead for every match, it furthers checks character by character to ensure it's not an anagram. So is it wrong if I do it this way? My opinion is with this code as soon as the hash value matches, you never have to further check both the strings character by character and you can just move on to the next.

It's not necessary that only anagrams collide with the hash value of the pattern. Any other string with same hash value could also collide. Same hash value can act as a liar, so character by character match is required.
For example in your case, you are taking mod 100. Take any distinct 101 patterns, then by the Pigeonhole principle, at least two of them would be having the same hash. If you use one of them as a pattern then the presence of other string would err your output if you avoid character match.
Moreover, even with the hash you used, two anagrams can have the same hash value which can be obtained by solving two linear equations.
For example,
DCE = 4*1 + 3*2 + 5*3 = 25
CED = 3*1 + 5*2 + 4*3 = 25

Related

Finding all the shortest unique substring which are of same length?

Given a string sequence which contains only four letters, ['a','g','c','t']
for example: agggcttttaaaatttaatttgggccc.
Find all the shortest unique sub-string of the string sequence which are of equal length (the length should be minimum of all the unique sub-strings) ?
For example : aaggcgccttt
answer: ['aa', 'ag', 'gg','cg', 'cc','ct']
explanation:shortest unique sub-string of length 2
I have tried using suffix-arrays coupled with longest common prefix but i am unable to draw the solution perfectly.
I'm not sure what you mean by "minimum unique sub-string", but looking at your example I assume you mean "shortest runs of a single letter". If this is the case, you just need to iterate through the string once (character by character) and count all the shortest runs you find. You should keep track of the length of the minimum run found so far (infinity at start) and the length of the current run.
If you need to find the exact runs, you can add all the minimum runs you find to e.g. a list as you iterate through the string (and modify that list accordingly if a shorter run is found).
EDIT:
I thought more about the problem and came up with the following solution.
We find all the unique sub-strings of length i (in ascending order). So, first we consider all sub-strings of length 1, then all sub-strings of length 2, and so on. If we find any, we stop, since the sub-string length can only increase from this point.
You will have to use a list to keep track of the sub-strings you've seen so far, and a list to store the actual sub-strings. You will also have to maintain them accordingly as you find new sub-strings.
Here's the Java code I came up with, in case you need it:
String str = "aaggcgccttt";
String curr = "";
ArrayList<String> uniqueStrings = new ArrayList<String>();
ArrayList<String> alreadySeen = new ArrayList<String>();
for (int i = 1; i < str.length(); i++) {
for (int j = 0; j < str.length() - i + 1; j++) {
curr = str.substring(j, j + i);
if (!alreadySeen.contains(curr)){ //Sub-string hasn't been seen yet
uniqueStrings.add(curr);
alreadySeen.add(curr);
}
else //Repeated sub-string found
uniqueStrings.remove(curr);
}
if (!uniqueStrings.isEmpty()) //We have found non-repeating sub-string(s)
break;
alreadySeen.clear();
}
//Output
if (uniqueStrings.isEmpty())
System.out.println(str);
else {
for (String s : uniqueStrings)
System.out.println(s);
}
The uniqueStrings list contains all the unique sub-strings of minimum length (used for output). The alreadySeen list keeps track of all the sub-strings that have already been seen (used to exclude repeating sub-strings).
I'll write some code in Python, because that's what I find the easiest.
I actually wrote both the overlapping and the non-overlapping variants. As a bonus, it also checks that the input is valid.
You seems to be interested only in the overlapping variant:
import itertools
def find_all(
text,
pattern,
overlap=False):
"""
Find all occurrencies of the pattern in the text.
Args:
text (str|bytes|bytearray): The input text.
pattern (str|bytes|bytearray): The pattern to find.
overlap (bool): Detect overlapping patterns.
Yields:
position (int): The position of the next finding.
"""
len_text = len(text)
offset = 1 if overlap else (len(pattern) or 1)
i = 0
while i < len_text:
i = text.find(pattern, i)
if i >= 0:
yield i
i += offset
else:
break
def is_valid(text, tokens):
"""
Check if the text only contains the specified tokens.
Args:
text (str|bytes|bytearray): The input text.
tokens (str|bytes|bytearray): The valid tokens for the text.
Returns:
result (bool): The result of the check.
"""
return set(text).issubset(set(tokens))
def shortest_unique_substr(
text,
tokens='acgt',
overlapping=True,
check_valid_input=True):
"""
Find the shortest unique substring.
Args:
text (str|bytes|bytearray): The input text.
tokens (str|bytes|bytearray): The valid tokens for the text.
overlap (bool)
check_valid_input (bool): Check if the input is valid.
Returns:
result (set): The set of the shortest unique substrings.
"""
def add_if_single_match(
text,
pattern,
result,
overlapping):
match_gen = find_all(text, pattern, overlapping)
try:
next(match_gen) # first match
except StopIteration:
# the pattern is not found, nothing to do
pass
else:
try:
next(match_gen)
except StopIteration:
# the pattern was found only once so add to results
result.add(pattern)
else:
# the pattern is found twice, nothing to do
pass
# just some sanity check
if check_valid_input and not is_valid(text, tokens):
raise ValueError('Input text contains invalid tokens.')
result = set()
# shortest sequence cannot be longer than this
if overlapping:
max_lim = len(text) // 2 + 1
max_lim = len(tokens)
for n in range(1, max_lim + 1):
for pattern_gen in itertools.product(tokens, repeat=2):
pattern = ''.join(pattern_gen)
add_if_single_match(text, pattern, result, overlapping)
if len(result) > 0:
break
else:
max_lim = len(tokens)
for n in range(1, max_lim + 1):
for i in range(len(text) - n):
pattern = text[i:i + n]
add_if_single_match(text, pattern, result, overlapping)
if len(result) > 0:
break
return result
After some sanity check for the correctness of the outputs:
shortest_unique_substr_ovl = functools.partial(shortest_unique_substr, overlapping=True)
shortest_unique_substr_ovl.__name__ = 'shortest_unique_substr_ovl'
shortest_unique_substr_not = functools.partial(shortest_unique_substr, overlapping=False)
shortest_unique_substr_not.__name__ = 'shortest_unique_substr_not'
funcs = shortest_unique_substr_ovl, shortest_unique_substr_not
test_inputs = (
'aaa',
'aaaa',
'aaggcgccttt',
'agggcttttaaaatttaatttgggccc',
)
import functools
for func in funcs:
print('Func:', func.__name__)
for test_input in test_inputs:
print(func(test_input))
print()
Func: shortest_unique_substr_ovl
set()
set()
{'cg', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct'}
Func: shortest_unique_substr_not
{'aa'}
{'aaa'}
{'cg', 'tt', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct', 'cc'}
it is wise to benchmark how fast we actually are.
Below you can find some benchmarks, produced using some template code from here (the overlapping variant is in blue):
and the rest of the code for completeness:
def gen_input(n, tokens='acgt'):
return ''.join([tokens[random.randint(0, len(tokens) - 1)] for _ in range(n)])
def equal_output(a, b):
return a == b
input_sizes = tuple(2 ** (1 + i) for i in range(16))
runtimes, input_sizes, labels, results = benchmark(
funcs, gen_input=gen_input, equal_output=equal_output,
input_sizes=input_sizes)
plot_benchmarks(runtimes, input_sizes, labels, units='ms')
plot_benchmarks(runtimes, input_sizes, labels, units='μs', zoom_fastest=2)
As far as the asymptotic time-complexity analysis is concerned, considering only the overlapping case, let N be the input size, let K be the number of tokens (4 in your case), find_all() is O(N), and the body of shortest_unique_substr is O(K²) (+ O((K - 1)²) + O((K - 2)²) + ...).
So, this is overall O(N*K²) or O(N*(Σk²)) (for k = 1, …, K), since K is fixed, this is O(N), as the benchmarks seem to indicate.

How to affect only letters, not punctuation in Caesar Cipher code

I am trying to write a Caesar Cipher in Ruby and I hit a snag when trying to change only the letters to a numerical values and not the punctuation marks.
Here is my script so far:
def caesar_cipher(phrase, key)
array = phrase.split("")
number = array.map {|n| n.upcase.ord - (64-key)}
puts number
end
puts "Script running"
caesar_cipher("Hey what's up", 1)
I tried to use select but I couldn't figure out how to select only the punctuation marks or only the letters.
Use String#gsub to match only the characters that you want to replace. In this case it's the letters of the alphabet, so you'll use the regular expression /[a-z]/i.
You can pass a block to gsub which will be called for each match in the string, and the return value of the block will be used as the replacement. For example:
"Hello, world!".gsub(/[a-z]/i) {|chr| (chr.ord + 1).chr }
# => Ifmmp, xpsme!"
Here's a version of your Caesar cipher method that works pretty well:
BASE_ORD = 'A'.ord
def caesar_cipher(phrase, key)
phrase.gsub(/[a-z]/i) do |letter|
orig_pos = letter.upcase.ord - BASE_ORD
new_pos = (orig_pos + key) % 26
(new_pos + BASE_ORD).chr
end
end
caesar_cipher("Hey, what's up?", 1) # => "IFZ, XIBU'T VQ?"
Edit:
% is the modulo operator. Here it's used to make new_pos "wrap around" to the beginning of the alphabet if it's greater than 25.
For example, suppose letter is "Y" and key is 5. The position of "Y" in the alphabet is 24 (assuming "A" is 0), so orig_pos + key will be 29, which is past the end of the alphabet.
One solution would be this:
new_pos = orig_pos + key
if new_pos > 25
new_pos = new_pos - 26
end
This would make new_pos 3, which corresponds to the letter "D," the correct result. We can get the same result more efficiently, however, by taking "29 modulo 26"—expressed in Ruby (and many other languages) as 29 % 26—which returns the remainder of the operation 29 ÷ 26. (because there are 26 letters in the alphabet). 29 % 26 is 3, the same result as above.
In addition to constraining a number to a certain range, as we do here, the modulo operator is also often used to test whether a number is divisible by another number. For example, you can check if n is divisible by 3 by testing n % 3 == 0.

Printing the n most frequent words in a file (string)

The goal:
Write a function that takes two parameters: (1) a String representing a text document and (2) an integer providing the number of items to return. Implement the function such that it returns a list of Strings ordered by word frequency, the most frequently occurring word first. Use your best judgement to decide how words are separated. Your solution should run in O(n) time where n is the number of characters in the document.
My thoughts were that, in the worst case, the input to the function could be the total number of words in the document, reducing the problem to sorting the words by their frequencies. This made me think that the lower bound for time complexity would be O (n log n) if I used a comparison sorting method. So, my thought was that the best approach was to implement a counting sort. Here is my code.
I would like for you to tell me whether my analysis is correct, I've annotated the code with my idea of what the time complexity is, but it could definitely be incorrect. What is the actual time and space complexity of this code? Also I would like to hear if this is in fact a good approach, if there are any alternate approaches that would be used in practice.
### n is number of characters in string, k is number of words ###
def word_frequencies(string, n)
words = string.split(/\s/) # O(n)
max = 0
min = Float::INFINITY
frequencies = words.inject(Hash.new(0)) do |hash,word| # O(k)
occurrences = hash[word] += 1 # O(1)
max = occurrences if occurrences > max # O(1)
min = occurrences if occurrences < min # O(1)
hash; # O(1)
end
### perform a counting sort ###
sorted = Array.new(max + words.length)
delta = 0
frequencies.each do |word, frequency| #O(k)
p word + "--" + frequency.to_s
index = frequency
if sorted[index]
sorted[index] = sorted[index].push(word) # ??? I think O(1).
else
sorted[index] = [word] # O(1)
end
end
return sorted.compact.flatten[-n..-1].reverse
### Compact is O(k). Flatten is O(k). Reverse is O(k). So O(3k)
end
### Total --- O(n + 5k) = O(n). Correct?
### And the space complexity is O(n) for the hash + O(2k) for the sorted array.
### So total O(n).
text = "hi hello hi my name is what what hi hello hi this is a test test test test hi hi hi what hello these are some words these these"
p word_frequencies(text, 4)
Two ways:
def word_counter(string, max)
string.split(/\s+/)
.group_by{|x|x}
.map{|x,y|[x,y.size]}
.sort_by{|_,size| size} # Have to sort =/
.last(max)
end
def word_counter(string, max)
# Create a Hash and a List to store values in.
word_counter, max_storage = Hash.new(0), []
#Split the string an and add each word to the hash:
string.split(/\s+/).each{|word| word_counter[word] += 1}
# Take each word and add it to the list (so that the list_index = word_count)
# I also add the count, but that is not really needed
word_counter.each{|key, val| max_storage[val] = [*max_storage[val]] << [key, val]}
# Higher count will always be at the end, remove nils and get the last "max" elements.
max_storage.compact.flatten(1).last(max)
end
One idea is following:
You are already constructing a hash map that gives the frequency of a given word.
Now iterate through this hash map and create a reverse "hash set". That is the set of words for a given frequency.
Find the maximum frequency and output the set of words for that frequency.
Decrement it, and check for words in the hash set.
Keep doing this till the required number of words.
The order of this algorithm shall be O(f) where f is the maximum frequency of any word. The maximum frequency of any word shall be at most n where n is the number of characters as required.
Sample, quick way :)
#assuming you read from the file and get it to a string called str
h = {}
arr = str.split("\n")
arr.each do |i|
i.split(" ").each do |w|
if h.has_key[w]
h[w] += 1
else
h[w] = 1
end
end
end
Hash[h.sort_by{|k, v| v}.reverse]
This works, but could be improved.

Generate same unique hash code for all anagrams

Recently, I attended an interview and faced a good question regarding hash collisions.
Question : Given a list of strings, print out the anagrams together.
Example : i/p : {act, god, animal, dog, cat}
o/p : act, cat, dog, god
I want to create hashmap and put the word as key and value as list of anagrams
To avoid collision, I want to generate unique hash code for anagrams instead of sorting and using the sorted word as key.
I am looking for hash algorithm which take care of collision other than using chaining. I want algorithm to generate same hash code for both act and cat... so that it will add next word to the value list
Can anyone suggest a good algorithm ?
Hashing with the sorted string is pretty nice, i'd have done that probably, but it could indeed be slow and cumbersome. Here's another thought, not sure if it works - pick a set of prime numbers, as small as you like, the same size as your character set, and build a fast mapping function from your chars to that. Then for a given word, map each character into the matching prime, and multiply. finally, hash using the result.
This is very similar to what Heuster suggested, only with less collisions (actually, I believe there will be no false collisions, given the uniqueness of the prime decomposition of any number).
simple e.g. -
int primes[] = {2, 3, 5, 7, ...} // can be auto generated with a simple code
inline int prime_map(char c) {
// check c is in legal char set bounds
return primes[c - first_char];
}
...
char* word = get_next_word();
char* ptr = word;
int key = 1;
while (*ptr != NULL) {
key *= prime_map(*ptr);
ptr++;
}
hash[key].add_to_list(word);
[edit]
A few words about the uniqueness - any integer number has a single breakdown to multiplications of primes, so given an integer key in the hash you can actually reconstruct all possible strings that would hash to it, and only these words. Just break into primes, p1^n1*p2^n2*... and convert each prime to the matching char. the char for p1 would appear n1 times, and so on.
You can't get any new prime you didn't explicitly used, being prime means you can't get it by any multiplication of other primes.
This brings another possible improvement - if you can construct the string, you just need to mark the permutations you saw when populating the hash. since the permutations can be ordered by lexicographic order, you can replace each one with a number. This would save the space of storing the actual strings in the hash, but would require more computations so it's not necessarily a good design choice. Still, it makes a nice complication of the original question for interviews :)
Hash function : Assign primary numbers to each character. While calculating hash code, get the prime number assigned to that character and multiply with to existing value.Now all anagrams produce same hash value.
ex :
a - 2,
c - 3
t - 7
hash value of cat = 3*2*7 = 42
hash value of act = 2*3*7 = 42
Print all strings which are having same hash value(anagrams will have same hash value)
The other posters suggested converting characters into prime numbers and multiplying them together. If you do this modulo a large prime, you get a good hash function that won't overflow. I tested the following Ruby code against the Unix word list of most English words and found no hash collisions between words that are not anagrams of one another. (On MAC OS X, this file is located here: /usr/share/dict/words.)
My word_hash function takes the ordinal value of each character mod 32. This will make sure that uppercase and lowercase letters have the same code. The large prime I use is 2^58 - 27. Any large prime will do so long as it is less than 2^64 / A where A is my alphabet size. I am using 32 as my alphabet size, so this means I can't use a number larger than about 2^59 - 1. Since ruby uses one bit for sign and a second bit to indicate if the value is a number or an object, I lose a bit over other languages.
def word_hash(w)
# 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
# Punctuation gets assigned values that overlap the letters, but we don't care about that much.
primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
# Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
prime_modulus = (1 << 58) - 27
h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
end
words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
wordcount = words.size
anagramcount = words.map { |w| w.chars.sort.join }.uniq.count
whash = {}
inverse_hash = {}
words.each do |w|
h = word_hash(w)
whash[w] = h
x = inverse_hash[h]
if x && x.each_char.sort.join != w.each_char.sort.join
puts "Collision between #{w} and #{x}"
else
inverse_hash[h] = w
end
end
hashcount = whash.values.uniq.size
puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."
Small practical Optimization , I would suggest for the above hash method is :
Assign least prime number to vowels and then most frequently occurring consonants.
Ex :
e : 2
a : 3
i : 5
o : 7
u : 11
t : 13
and so on...
Also, average word length for english is : ~ 6
Also, top 26 prime numbers are less than 100 [2,3,5,7, .. , 97]
Hence, on average your hash would generate value around 100^6 = 10^12.
So there are very less chances of collision if you take prime number for modulo bigger than 10^12.
The complexity above seems very misplaced! You don't need prime numbers or hashes. It's just three simple ops:
Map each OriginalWord to a Tuple of (SortedWord, OriginalWord). Example: "cat" becomes ("act", "cat"); "dog" becomes ("dgo", "dog"). This is a simple sort on the chars of each OriginalWord.
Sort the Tuples by their first element. Example: ("dgo", "dog"), ("act, "cat") sorts to ("act", "cat"), ("dgo", "dog"). This is a simple sort on the entire collection.
Iterate through the tuples (in order), emitting the OriginalWord. Example: ("act", "cat"), ("dgo", "dog") emits "cat" "dog". This is a simple iteration.
Two iterations and two sorts are all it takes!
In Scala, it's exactly one line of code:
val words = List("act", "animal", "dog", "cat", "elvis", "lead", "deal", "lives", "flea", "silent", "leaf", "listen")
words.map(w => (w.toList.sorted.mkString, w)).sorted.map(_._2)
# Returns: List(animal, act, cat, deal, lead, flea, leaf, dog, listen, silent, elvis, lives)
Or, as the original question implies, you only want cases where the count > 1, it's just a bit more:
scala> words.map(w => (w.toList.sorted.mkString, w)).groupBy(_._1).filter({case (k,v) => v.size > 1}).mapValues(_.map(_._2)).values.toList.sortBy(_.head)
res64: List[List[String]] = List(List(act, cat), List(elvis, lives), List(flea, leaf), List(lead, deal), List(silent, listen))
The solution using product of primes is brilliant and here's a Java implementation incase anyone needs one.
class HashUtility {
private int n;
private Map<Character, Integer> primeMap;
public HashUtility(int n) {
this.n = n;
this.primeMap = new HashMap<>();
constructPrimeMap();
}
/**
* Utility to check if the passed {#code number} is a prime.
*
* #param number The number which is checked to be prime.
* #return {#link boolean} value representing the prime nature of the number.
*/
private boolean isPrime(int number) {
if (number <= 2)
return number == 2;
else
return (number % 2) != 0
&&
IntStream.rangeClosed(3, (int) Math.sqrt(number))
.filter(n -> n % 2 != 0)
.noneMatch(n -> (number % n == 0));
}
/**
* Maps all first {#code n} primes to the letters of the given language.
*/
private void constructPrimeMap() {
List<Integer> primes = IntStream.range(2, Integer.MAX_VALUE)
.filter(this::isPrime)
.limit(this.n) //Limit the number of primes here
.boxed()
.collect(Collectors.toList());
int curAlphabet = 0;
for (int i : primes) {
this.primeMap.put((char) ('a' + curAlphabet++), i);
}
}
/**
* We calculate the hashcode of a word by calculating the product of each character mapping prime. This works since
* the product of 2 primes is unique from the products of any other primes.
* <p>
* Since the hashcode can be huge, we return it modulo a large prime.
*
* #param word The {#link String} to be hashed.
* #return {#link int} representing the prime hashcode associated with the {#code word}
*/
public int hashCode(String word) {
long primeProduct = 1;
long mod = 100000007;
for (char currentCharacter : word.toCharArray()) {
primeProduct *= this.primeMap.get(currentCharacter) % mod;
}
return (int) primeProduct;
}
}
Please let me know if/how I can improve this.
We can use the binary value representation of array. This code snippet is assuming all characters are small latin characters.
public int hashCode() {
//TODO: so that each set of anagram generates same hashCode
int sLen = s.length();
int [] ref = new int[26];
for(int i=0; i< sLen; i++) {
ref[s.charAt(i) - 'a'] +=1;
}
int hashCode = 0;
for(int i= 0; i < ref.length; i++) {
hashCode += new Double(Math.pow(2, i)).intValue() * ref[i];
}
return hashCode;
}
create the hascode in following way
String hash(String s){
char[] hashValue = new char[26];
for(char c: s.toCharArray()){
hash[c-'a']++;
}
return new String(hashValue);
}
here the hash will be initialized with the default value of char u0000 and an increment will make the value to the next Unicode. since its a char array we can convert it to string and use it as the key

How do I modify the Damerau-Levenshtein algorithm, such that it also includes the start index, and the end index of the larger substring?

Here is my code:
#http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
# used for fuzzy matching of two strings
# for indexing, seq2 must be the parent string
def dameraulevenshtein(seq1, seq2)
oneago = nil
min = 100000000000 #index
max = 0 #index
thisrow = (1..seq2.size).to_a + [0]
seq1.size.times do |x|
twoago, oneago, thisrow = oneago, thisrow, [0] * seq2.size + [x + 1]
seq2.size.times do |y|
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + ((seq1[x] != seq2[y]) ? 1 : 0)
thisrow[y] = [delcost, addcost, subcost].min
if (x > 0 and y > 0 and seq1[x] == seq2[y-1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y])
thisrow[y] = [thisrow[y], twoago[y-2] + 1].min
end
end
end
return thisrow[seq2.size - 1], min, max
end
there has to be someway to get the starting and ending index of substring, seq1, withing parent string, seq2, right?
I'm not entirely sure how this algorithm works, even after reading the wiki article on it. I mean, I understand the highest level explanation, as it finds the insertion, deletion, and transposition difference (the lines in the second loop).. but beyond that. I'm a bit lost.
Here is an example of something that I wan to be able to do with this (^):
substring = "hello there"
search_string = "uh,\n\thello\n\t there"
the indexes should be:
start: 5
end: 18 (last char of string)
Ideally, the search_string will never be modified. But, I guess I could take out all the white space characters (since there are only.. 3? \n \r and \t) store the indexes of each white space character, get the indexes of my substring, and then re-add in the white space characters, making sure to compensate the substring's indexes as I offset them with the white space characters that were originally in there in the first place. -- but if this could all be done in the same method, that would be amazing, as the algorithm is already O(n^2).. =(
At some point, I'd like to only allow white space characters to split up the substring (s1).. but one thing at a time
I don't think this algorithm is the right choice for what you want to do. The algorithm is simply calculating the distance between two strings in terms of the number of modifications you need to make to turn one string into another. If we rename your function to dlmatch for brevity and only return the distance, then we have:
dlmatch("hello there", "uh, \n\thello\n\t there"
=> 7
meaning that you can convert one string into the other in 7 steps (effectively by removing seven characters from the second). The problem is that 7 steps is a pretty big difference:
dlmatch("hello there", "panda here"
=> 6
This would actually imply that "hello there" and "panda here" are closer matches than the first example.
If what you are trying to do is "find a substring that mostly matches", I think you are stuck with an O(n^3) algorithm as you feed the first string to a series of substrings of the second string, and then selecting the substring that provides you the closest match.
Alternatively, you may be better off trying to do pre-processing on the search string and then doing regexp matching with the substring. For example, you could strip off all special characters and then build a regexp that looks for words in the substring that are case insensitive and can have any amount of whitespace between them.

Resources