How to do fuzzy string matching of bigger than memory dictionary in an ordered key-value store? - fuzzy-search

I am looking for an algorithm and storage schema to do string matching over a bigger than memory dictionary.
My initial attempt, inspired from https://swtch.com/~rsc/regexp/regexp4.html, was to store trigams of every word of the dictionary for instance the word apple is split into $ap, app, ppl, ple and le$ at index time. All of those trigram as associated with the word they came from.
Then I query time, I do the same for the input string that must be matched. I look up every of those trigram in the database and store in the candidate words in mapping associated with the number of matching trigrams in them. Then, I proceed to compute the levenshtein distance between every candidate and apply the following formula:
score(query, candidate) = common_trigram_number(query, candidate) - abs(levenshtein(query, candidate))
There is two problems with this approach, first the candidate selection is too broad. Second the levenshtein distance is too slow to compute.
Fixing the first, could make the second useless to optimize.
I thought about another approach, at index time, instead of storing trigrams, I will store words (possibly associated with frequency). At query time, I could lookup successive prefixes of the query string and score using levenshtein and frequency.
In particular, I am not looking for an algorithm that gives me strings at a distance of 1, 2 etc... I would like to just have a paginated list of more-or-less relevant words from the dictionary. The actual selection is made by the user.
Also it must be possible to represent it in terms of ordered key-value store like rocksdb or wiredtiger.

simhash captures similarity between (small) strings. But it does not really solve the problem of querying most similar string in a bigger than RAM dataset. I think, the original paper recommends to index some permutations, it requires a lot of memory and it does not take advantage of the ordered nature of OKVS.
I think I found a hash that allows to capture similarity inside the prefix of the hash:
In [1]: import fuzz
In [2]: hello = fuzz.bbkh("hello")
In [3]: helo = fuzz.bbkh("helo")
In [4]: hellooo = fuzz.bbkh("hellooo")
In [5]: salut = fuzz.bbkh("salut")
In [6]: len(fuzz.lcp(hello.hex(), helo.hex())) # Longest Common Prefix
Out[6]: 213
In [7]: len(fuzz.lcp(hello.hex(), hellooo.hex()))
Out[7]: 12
In [8]: len(fuzz.lcp(hello.hex(), salut.hex()))
Out[8]: 0
After small test over wikidata labels it seems to give good results:
$ time python fuzz.py query 10 france
* most similar according to bbk fuzzbuzz
** france 0
** farrance -2
** freande -2
** defrance -2
real 0m0.054s
$ time python fuzz.py query 10 frnace
* most similar according to bbk fuzzbuzz
** farnace -1
** france -2
** fernacre -2
real 0m0.060s
$ time python fuzz.py query 10 beglium
* most similar according to bbk fuzzbuzz
** belgium -2
real 0m0.047s
$ time python fuzz.py query 10 belgium
* most similar according to bbk fuzzbuzz
** belgium 0
** ajbelgium -2
real 0m0.059s
$ time python fuzz.py query 10 begium
* most similar according to bbk fuzzbuzz
** belgium -1
** beijum -2
real 0m0.047s
Here is an implementation:
HASH_SIZE = 2**10
BBKH_LENGTH = int(HASH_SIZE * 2 / 8)
chars = ascii_lowercase + "$"
ONE_HOT_ENCODER = sorted([''.join(x) for x in product(chars, chars)])
def ngram(string, n):
return [string[i:i+n] for i in range(len(string)-n+1)]
def integer2booleans(integer):
return [x == '1' for x in bin(integer)[2:].zfill(HASH_SIZE)]
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
def merkletree(booleans):
assert len(booleans) == HASH_SIZE
length = (2 * len(booleans) - 1)
out = [False] * length
index = length - 1
booleans = list(reversed(booleans))
while len(booleans) > 1:
for boolean in booleans:
out[index] = boolean
index -= 1
new = []
for (right, left) in chunks(booleans, 2):
value = right or left
new.append(value)
booleans = new
return out
def bbkh(string):
integer = 0
string = "$" + string + "$"
for gram in ngram(string, 2):
hotbit = ONE_HOT_ENCODER.index(gram)
hotinteger = 1 << hotbit
integer = integer | hotinteger
booleans = integer2booleans(integer)
tree = merkletree(booleans)
fuzz = ''.join('1' if x else '0' for x in tree)
buzz = int(fuzz, 2)
hash = buzz.to_bytes(BBKH_LENGTH, 'big')
return hash
def lcp(a, b):
"""Longest Common Prefix between a and b"""
out = []
for x, y in zip(a, b):
if x == y:
out.append(x)
else:
break
return ''.join(out)
Evaluation: Using Wikipedia common misspelled words, there is around 8k words. Considering top 10 nearest words, yields 77% success with each query taking around 20ms. Considering top 100, yields 94% success with each query taking less than 200ms. Most mistakes come from joined words (e.g. "abouta" instead of "about a") or words separated with a dash.
Checkout the code at https://github.com/amirouche/fuzzbuzz/blob/master/typofix.py
Note: computing simhash instead of the input string, only works with a bag of lemma or stem, indeed it finds similar documents.
Using a bytes encoding is an optimization. So it is possible to figure what binary representation 0b001 means.

Related

Finding sum of all integer substring using dynamic programming

I was solving Sam and substrings problem from hackerrank. It is basically finding sum of all substrings of a string having all integers.
Samantha and Sam are playing a numbers game. Given a number as a string, no leading zeros, determine the sum of all integer values of substrings of the string.
Given an integer as a string, sum all of its substrings cast as integers. As the number may become large, return the value modulo 10⁹ + 7.
Example: n = '42'
Here n is a string that has three integer substrings: 4, 2, and 42. Their sum is 48, and 48 modulo 10⁹ + 7 = 48.
Function Description
Complete the substrings function in the editor below.
substrings has the following parameter(s):
string n: the string representation of an integer
Returns
int: the sum of the integer values of all substrings in n, modulo (10⁹ + 7)
I tried following recursive top down dynamic problem solution with memoization:
from functools import cache
def substrings(n):
#cache
def substrSum(curIndex):
if curIndex == 0: return int(n[0])
return substrSum(curIndex-1)*10 + int(n[curIndex]) * (curIndex+1)
totalSum = 0
for i in range(len(n)-1, -1,-1):
totalSum += substrSum(i)
return totalSum % (10 ** 9 + 7)
I also tried recursive bottom up dynamic programming solution with memoization (this simply involves changing for loop counting direction):
from functools import cache
def substrings(n):
#cache
def substrSum(curIndex):
if curIndex == 0: return int(n[0])
return substrSum(curIndex-1)*10 + int(n[curIndex]) * (curIndex+1)
totalSum = 0
for i in range(len(n)):
totalSum += substrSum(i)
return totalSum % (10 ** 9 + 7)
For top-down solution gives runtime error in 8 out of 13 test cases, whearas in bottom up solution gives gives runtime error in 6 out of 13 test cases. Where am I making mistake?
Your algorithm is correct (both versions), but HackerRank will test with strings that have many thousands of digits, and as you perform a recursive call for each digit, your first code runs into a maximum recursion depth exceeded error, and the second one runs into a memory error (think of the cache).
It should be noted that they phrased the constraint wrong. It is not the value of n "cast to integer" that is limited by 2 x 105, but the number of digits in n. I checked this, and one of their tests concerns a string of about 199000 digits.

Generating unique non-similar codes with validation

I know there are similar questions so please bear with me.
I wish to generate approximately 50K codes for people to place orders - ideally no longer than 10 chars and can include letters and digits. They are not discount codes so I am not worried about people trying to guess codes. What I am worried about is somebody accidentally entering a wrong digit (ie 1 instead of l or 0 instead of O) and then the system will fail if by chance it is also a valid code.
As the codes are constantly being generated, ideally I don't want a table look-up validation, but an formula (eg if it contains an A the number element should be divisable by 13 or some such).
Select some alphabet (made of digits and letters) of size B such that there are no easy confusions. Assign every symbol a value from 0 to B-1, preferably in random order. Now you can use sequential integers, convert them to base B and assign the symbols accordingly.
For improved safety, you can append one or two checksum symbols for error detection.
With N=34 (ten digits and twenty four letters 9ABHC0FVW3YGJKL1N2456XRTS78DMPQEUZ), 50K codes require codes of length only four.
If you don't want the generated codes to be consecutive, you can scramble the bits before the change of base.
Before you start generating random combinations of characters, there are a couple of things you need to bear in mind:
1. Profanity
If your codes include every possible combination of four letters from the alphabet, they will inevitably include every four-letter word. You need to be absolutely sure that you never ask customers to enter anything foul or offensive.
2. Human error
People often make mistakes when entering codes. Confusing similar characters like O and 0 is only part of the problem. Other common mistakes include transposing adjacent characters (e.g. the → teh) and hitting the wrong key on the keyboard (e.g., and → amd)
To avoid these issues, I would recommend that you generate codes from a restricted alphabet that has no possibility of spelling out anything unfortunate, and use the Luhn algorithm or something similar to catch accidental data entry errors.
For example, here's some Python code that generates hexadecimal codes using an alphabet of 16 characters with no vowels. It uses a linear congruential generator step to avoid outputting sequential numbers, and includes a base-16 Luhn checksum to detect input errors. The code2int() function will return −1 if the checksum is incorrect. Otherwise it will return an integer. If this integer is less than your maximum input value (e.g., 50,000), then you can assume the code is correct.
def int2code(n):
# Generates a 7-character code from an integer value (n > 0)
alph = 'BCDFGHJKMNPRTWXZ'
mod = 0xfffffd # Highest 24-bit prime
mul = 0xc36572 # Randomly selected multiplier
add = 0x5d48ca # Randomly selected addend
# Convert the input number `n` into a non-sequential 6-digit
# hexadecimal code by means of a linear congruential generator
c = "%06x" % ((n * mul + add) % mod)
# Replace each hex digit with the corresponding character from alph.
# and generate a base-16 Luhn checksum at the same time
luhn_sum = 0
code = ''
for i in range(6):
d = int(c[i], 16)
code += alph[d]
if i % 2 == 1:
t = d * 15
luhn_sum += (t & 0x0f) + (t >> 4)
else:
luhn_sum += d
# Append the checksum
checksum = (16 - (luhn_sum % 16)) % 16
code += alph[checksum]
return code
def code2int(code):
# Converts a 7-character code back into an integer value
# Returns -1 if the input is invalid
alph = 'BCDFGHJKMNPRTWXZ'
mod = 0xfffffd # Highest 24-bit prime
inv = 0x111548 # Modular multiplicative inverse of 0xc36572
sub = 0xa2b733 # = 0xfffffd - 0x5d48ca
if len(code) != 7:
return -1
# Treating each character as a hex digit, convert the code back into
# an integer value. Also make sure the Luhn checksum is correct
luhn_sum = 0
c = 0
for i in range(7):
if code[i] not in alph:
return -1
d = alph.index(code[i])
c = c * 16 + d
if i % 2 == 1:
t = d * 15
luhn_sum += (t & 0x0f) + (t >> 4)
else:
luhn_sum += d
if luhn_sum % 16 != 0:
return -1
# Discard the last digit (corresponding to the Luhn checksum), and undo
# the LCG calculation to retrieve the original input value
c = (((c >> 4) + sub) * inv) % mod
return c
# Test
>>> print('\n'.join([int2code(i) for i in range(10)]))
HWGMTPX
DBPXFZF
XGCFRCN
PKKNDJB
JPWXNRK
DXGGCBR
ZCPNMDD
RHBXZKN
KMKGJTZ
FRWNXCH
>>> print(all([code2int(int2code(i)) == i for i in range(50000)]))
True

How to use the trained char-rnn to generate words?

When the char-rnn is trained, the weights of the network is fixed. If I use the same first char, how can I get the different sentence? Such as the two sentences "What is wrong?" and "What can I do for you?"
have the same first word "W". Can the char-rnn generate the two different sentences?
Yes, you can get different results from the same state by sampling. Take a look at min-char-rnn by Andrej Karpathy. The sample code is at line 63:
def sample(h, seed_ix, n):
"""
sample a sequence of integers from the model
h is memory state, seed_ix is seed letter for first time step
"""
x = np.zeros((vocab_size, 1))
x[seed_ix] = 1
ixes = []
for t in xrange(n):
h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
y = np.dot(Why, h) + by
p = np.exp(y) / np.sum(np.exp(y))
ix = np.random.choice(range(vocab_size), p=p.ravel())
x = np.zeros((vocab_size, 1))
x[ix] = 1
ixes.append(ix)
return ixes
Starting from the same hidden vector h and seed char seed_ix, you'll have a deterministic distribution over the next char - p. But the result is random, because the code performs np.random.choice instead of np.argmax. If the distribution is highly peaked at some char, you'll still get the same outcome most of the time, but in most cases several next chars are highly probable and they will be sampled, thus changing the whole generated sequence.
Note that this isn't the only possible sampling procedure: temperature-based sampling is more popular. You can take a look at, for instance, this post for overview.

Feasibility of a bit modified version of Rabin Karp algorithm

I am trying to implement a bit modified version of Rabin Karp algorithm. My idea is if I get a hash value of the given pattern in terms of weight associated with each letter, then I don't have to worry about anagrams so I can just pick up a part of the string, calculate its hash value and compare with hash value of the pattern unlike traditional approach where hashvalue of both part of string and pattern is calculated and then checked whether they are actually similar or it could be an anagram. Here is my code below
string = "AABAACAADAABAABA"
pattern = "AABA"
#string = "gjdoopssdlksddsoopdfkjdfoops"
#pattern = "oops"
#get hash value of the pattern
def gethashp(pattern):
sum = 0
#I mutiply each letter of the pattern with a weight
#So for eg CAT will be C*1 + A*2 + T*3 and the resulting
#value wil be unique for the letter CAT and won't match if the
#letters are rearranged
for i in range(len(pattern)):
sum = sum + ord(pattern[i]) * (i + 1)
return sum % 101 #some prime number 101
def gethashst(string):
sum = 0
for i in range(len(string)):
sum = sum + ord(string[i]) * (i + 1)
return sum % 101
hashp = gethashp(pattern)
i = 0
def checkMatch(string,pattern,hashp):
global i
#check if we actually get first four strings(comes handy when you
#are nearing the end of the string)
if len(string[:len(pattern)]) == len(pattern):
#assign the substring to string2
string2 = string[:len(pattern)]
#get the hash value of the substring
hashst = gethashst(string2)
#if both the hashvalue matches
if hashst == hashp:
#print the index of the first character of the match
print("Pattern found at {}".format(i))
#delete the first character of the string
string = string[1:]
#increment the index
i += 1 #keep a count of the index
checkMatch(string,pattern,hashp)
else:
#if no match or end of string,return
return
checkMatch(string,pattern,hashp)
The code is working just fine. My question is this a valid way of doing it? Can there be any instance where the logic might fail? All the Rabin Karp algorithms that I have come across doesn't use this logic instead for every match, it furthers checks character by character to ensure it's not an anagram. So is it wrong if I do it this way? My opinion is with this code as soon as the hash value matches, you never have to further check both the strings character by character and you can just move on to the next.
It's not necessary that only anagrams collide with the hash value of the pattern. Any other string with same hash value could also collide. Same hash value can act as a liar, so character by character match is required.
For example in your case, you are taking mod 100. Take any distinct 101 patterns, then by the Pigeonhole principle, at least two of them would be having the same hash. If you use one of them as a pattern then the presence of other string would err your output if you avoid character match.
Moreover, even with the hash you used, two anagrams can have the same hash value which can be obtained by solving two linear equations.
For example,
DCE = 4*1 + 3*2 + 5*3 = 25
CED = 3*1 + 5*2 + 4*3 = 25

Optimized way of finding similar strings

Suppose I have a large list of words. (about 4-5 thousands and increasing). Someone searched for a string. Unfortunately, The string was not found on the wordlist. Now what would be the best and optimized way to find words similar to the input string? The first thing that came to my mind was calculating Levenshtein distance between each entry of the wordlist and the input string. But is that the optimized way to do that?
(Note that, this is not language-specific question)
EDIT: new solution
Yes, calculating Levenshtein distances between your input and the word list can be a reasonable approach, but takes a lot of time. BK-trees can improve this, but they become slow quickly as the Levenshtein distance becomes bigger. It seems we can speed up the Levenshtein distance calculations using a trie, as described in this excellent blog post:
Fast and Easy Levenshtein distance using a Trie
It relies on the fact that the dynamic programming lookup table for Levenshtein distance has common rows in different invocations i.e. levenshtein(kate,cat) and levenshtein(kate,cats).
Running the Python program given on that page with the TWL06 dictionary gives:
> python dict_lev.py HACKING 1
Read 178691 words into 395185 nodes
('BACKING', 1)
('HACKING', 0)
('HACKLING', 1)
('HANKING', 1)
('HARKING', 1)
('HAWKING', 1)
('HOCKING', 1)
('JACKING', 1)
('LACKING', 1)
('PACKING', 1)
('SACKING', 1)
('SHACKING', 1)
('RACKING', 1)
('TACKING', 1)
('THACKING', 1)
('WHACKING', 1)
('YACKING', 1)
Search took 0.0189998 s
That's really fast, and would be even faster in other languages. Most of the time is spent in building the trie, which is irrelevant as it needs to be done just once and stored in memory.
The only minor downside to this is that tries take up a lot of memory (which can be reduced with a DAWG, at the cost of some speed).
Another approach: Peter Norvig has an great article (with complete source code) on spelling correction.
http://norvig.com/spell-correct.html
The idea is to build possible edits of the words, and then choose the most likely spelling correction of that word.
I think that something better than this exists, but BK trees are a good optimization from brute force at least.
It uses the property of Levenshtein distance being a metric space, and hence if you get a Levenshtein distance of d between your query and an arbitrary string s from the dict, then all your results must be at a distance from (d+n) to (d-n) to s. Here n is the maximum Levenshtein distance from the query you want to output.
It's explained in detail here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
If you are interested in the very code itself I implemented an algorithm for finding optimal alignment between two strings. It basically shows how to transform one string into the other with k operations (where k is Levenstein/Edit Distance of the strings). It could be bit simplified for your needs (as you only need the distance itself). By the way it works in O(mn) where m and n are lengths of the strings. My implementation is based on: this and this.
#optimalization: using int instead of strings:
#1 ~ "left", "insertion"
#7 ~ "up", "deletion"
#17 ~ "up-left", "match/mismatch"
def GetLevenshteinDistanceWithBacktracking(sequence1,sequence2):
distances = [[0 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)]
backtracking = [[1 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)]
for i in range(1, len(sequence1)+1):
distances[i][0]=i
for i in range(1, len(sequence2)+1):
distances[0][i]=i
for j in range(1, len(sequence2)+1):
for i in range(1, len(sequence1)+1):
if sequence1[i-1] == sequence2[j-1]:
distances[i][j]=distances[i-1][j-1]
backtracking[i][j] = 17
else:
deletion = distances[i-1][j]+1
substitution = distances[i-1][j-1]+1
insertion = distances[i][j-1] + 1
distances[i][j]=min( deletion, substitution, insertion)
if distances[i][j] == deletion:
backtracking[i][j] = 7
elif distances[i][j] == insertion:
backtracking[i][j] = 1
else:
backtracking[i][j] = 17
return (distances[len(sequence1)][len(sequence2)], backtracking)
def Alignment(sequence1, sequence2):
cost, backtracking = GetLevenshteinDistanceWithBacktracking(sequence1, sequence2)
alignment1 = alignment2 = ""
i = len(sequence1)
j = len(sequence2)
#from backtracking-matrix we get optimal-alignment
while(i > 0 or j > 0):
if i > 0 and j > 0 and backtracking[i][j] == 17:
alignment1 = sequence1[i-1] + alignment1
alignment2 = sequence2[j-1] + alignment2
i -= 1
j -= 1
elif i > 0 and backtracking[i][j] == 7:
alignment1 = sequence1[i-1] + alignment1
alignment2 = "-" + alignment2
i -= 1
elif j > 0 and backtracking[i][j]==1:
alignment2 = sequence2[j-1] + alignment2
alignment1 = "-" + alignment1
j -= 1
elif i > 0:
alignment1 = sequence1[i-1] + alignment1
alignment2 = "-" + alignment2
i -= 1
return (cost, (alignment1, alignment2))
It depends on broader context and how accurate you want to be. But what I would (probably) start with:
Only consider the subset which starts with the same character as the query word. It would decrease the amount of work by ~20 for a single query.
I would categorized words according to their lengths and for each category would allow the maximal distance to be a different number. In case of 4 categories e.g.:
0 -- if length is between 0 and 2; 1 -- if length is between 3 and 5; 2 -- if length is between 6 and 8; 3 -- if length is 9+. Then based on the query length you could just check the words from given category. Moreover it should not be hard to implement the algorithm to stop when the max. distance has been exceeded.
If needed would start to think about implementing some machine learning approach.

Resources