How to fuzzily search for a dictionary word? - algorithm

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)
f(x, y) = 1, if characters x and y are similar
= 0, otherwise
(These "similarities" can be specified by the programmer)
such that, say,
f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1
but,
f('a', 'z') = 0
etc.
Now if we have a query 'cofatyst', the algorithm should report the following matches:
('cot', 0)
('cat', 0)
('catalyst', 0)
where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).
One thing to note is that in the wild, the dictionary is going to be really large.

I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.
Not a very specific advise, I know, but I hope it helps you.
edited: Expanded answer.
With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.
It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.

I am using Fuse JavaScript Library for a project of mine. It is a javascript file which works on JSON dataset. It is quite fast. Have a look at it.
It has implemented a full Bitap algorithm, leveraging a modified version of the Diff, Match & Patch tool by Google(from his site).
The code is simple to understand the algorithm implementation done.

Related

Hashing algorithms for data summary

I am on the search for a non-cryptographic hashing algorithm with a given set of properties, but I do not know how to describe it in Google-able terms.
Problem space: I have a vector of 64-bit integers which are mostly linearlly distributed throughout that space. There are two exceptions to this rule: (1) The number 0 occurs considerably frequently and (2) if a number x occurs, it is more likely to occur again than 2^-64. The goal is, given two vectors A and B, to have a convenient mechanism for quickly detecting if A and B are not the same. Not all vectors are of fixed size, but any vector I wish to compare to another will have the same size (aka: a size check is trivial).
The only special requirement I have is I would like the ability to "back out" a piece of data. In other words, given A[i] = x and a hash(A), it should be cheap to compute hash(A) for A[i] = y. In other words, I want a non-cryptographic hash.
The most reasonable thing I have come up with is this (in Python-ish):
# Imagine this uses a Mersenne Twister or some other seeded RNG...
NUMS = generate_numbers(seed)
def hash(a):
out = 0
for idx in range(len(a)):
out ^= a[idx] ^ NUMS[idx]
return out
def hash_replace(orig_hash, idx, orig_val, new_val):
return orig_hash ^ (orig_val ^ NUMS[idx]) ^ (new_val ^ NUMS[idx])
It is an exceedingly simple algorithm and it probably works okay. However, all my experience with writing hashing algorithms tells me somebody else has already solved this problem in a better way.
I think what you are looking for is called homomorphic hashing algorithm and it has already been discussed Paillier cryptosystem.
As far as I can see from that discussion, there are no practical implementation nowadays.
The most interesting feature, the one for which I guess it fits your needs, is that:
H(x*y) = H(x)*H(y)
Because of that, you can freely define the lower limit of your unit and rely on that property.
I've used the Paillier cryptosystem a few years ago (there was a Java implementation somewhere, but I don't have anymore the link) during my studies, but it's far more complex in respect of what you are looking for.
It has interesting feature under certain constraints, like the following one:
n*C(x) = C(n*x)
Again, it looks to me similar to what you are looking for, so maybe you should search for this family of hashing algorithms. I'll have a try with Google searching for a more specific link.
References:
This one is quite interesting, but maybe it is not a viable solution because of your space that is [0-2^64[ (unless you accept to deal with big numbers).

Finding all matches from a large set of Strings in a smaller String

I have a large set of Words and Phrases (a lexicon or dictionary) which includes wildcards. I need to find all instances of those Words and Phrases within a much smaller String (~150 characters at the moment).
Initially, I wanted to run the operation in reverse; that is to check to see if each word in my smaller string exists within the Lexicon, which could be implemented as a Hash Table. The problem is that some of these values in my Lexicon are not single words and many are wildcards (e.g. substri*).
I'm thinking of using the Rabin-Karp algorithm but I'm not sure this is the best choice.
What's an efficient algorithm or method to perform this operation?
Sample Data:
The dictionary contains hundreds of words and can potentially expand. These words may end with wildcard characters (asterisks). Here are some random examples:
good
bad
freed*
careless*
great loss
The text we are analyzing (at this point) are short, informal (grammar-wise) English statements. The prime example of text (again, at this point in time) would be a Twitter Tweet. These are limited to roughly 140 Characters. For example:
Just got the Google nexus without a contract. Hands down its the best phone
I've ever had and the only thing that could've followed my N900.
While it may be helpful to note that we are performing very simple Sentiment Analysis on this text; our Sentiment Analysis technique is not my concern. I'm simply migrating an existing solution to a "real-time" processing system and need to perform some optimizations.
I think this is an excellent use case for the Aho-Corasick string-matching algorithm, which is specifically designed to find all matches of a large set of strings in a single string. It runs in two phases - a first phase in which a matching automaton is created (which can be done in advance and requires only linear time), and a second phase in which the automaton is used to find all matches (which requires only linear time, plus time proportional to the total number of matches). The algorithm can also be adapted to support wildcard searching as well.
Hope this helps!
One answer I wanted to throw out there was the Boyer-Moore search algorithm. It is the algorithm that grep uses. Grep is probably one of the fastest search tools available. In addition, you could use something like GNU Parallel to have grep run in parallel, thus really speeding up the algorithm.
In addition, here is a good article that you might find interesting.
You might still use your original idea, of checking every word in the text against the dictionary. However, in order to run it efficently you need to index the dictionary, to make lookups really fast. A trick used in information retrival systems is to store a so called permuterm index (http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html).
Basically what you want to do is to store in a dictionary every possible permutation of the words (e.g. for house):
house$
ouse$h
use$ho
...
e$hous
This index can then be used to quickly check against wildcard queries. If for example you have ho*e you can look in the permuterm index for a term beginning with e$ho and you would quickly find a match with house.
The search itself is usually done with some logarithmic search strategy (binary search or b-trees) and thus is usally very fast.
So long as the patterns are for complete words: you don't want or to match storage; spaces and punctuation are match anchors, then a simple way to go is translate your lexicon into a scanner generator input (for example you could use flex), generate a scanner, and then run it over your input.
Scanner generators are designed to identify occurrences of tokens in input, where each token type is described by a regular expression. Flex and similar programs create scanners rapidly. Flex handles up to 8k rules (in your case lexicon entries) by default, and this can be expanded. The generated scanners run in linear time and in practice are very fast.
Internally, the token regular expressions are transformed in a standard "Kleene's theorem" pipeline: First to a NFA, then to a DFA. The DFA is then transformed to its unique minimum form. This is encoded in a HLL table, which is emitted inside a wrapper that implements the scanner by referring to the table. This is what flex does, but other strategies are possible. For example the DFA can be translated to goto code where the DFA state is represented implicitly by the instruction pointer as the code runs.
The reason for the spaces-as-anchors caveat is that scanners created by programs like Flex are normally incapable of identifying overlapping matches: strangers could not match both strangers and range, for example.
Here is a flex scanner that matches the example lexicon you gave:
%option noyywrap
%%
"good" { return 1; }
"bad" { return 2; }
"freed"[[:alpha:]]* { return 3; }
"careless"[[:alpha:]]* { return 4; }
"great"[[:space:]]+"loss" { return 5; }
. { /* match anything else except newline */ }
"\n" { /* match newline */ }
<<EOF>> { return -1; }
%%
int main(int argc, char *argv[])
{
yyin = argc > 1 ? fopen(argv[1], "r") : stdin;
for (;;) {
int found = yylex();
if (found < 0) return 0;
printf("matched pattern %d with '%s'\n", found, yytext);
}
}
And to run it:
$ flex -i foo.l
$ gcc lex.yy.c
$ ./a.out
Good men can only lose freedom to bad
matched pattern 1 with 'Good'
matched pattern 3 with 'freedom'
matched pattern 2 with 'bad'
through carelessness or apathy.
matched pattern 4 with 'carelessness'
This doesn't answer the algorithm question exactly, but check out the re2 library. There are good interfaces in Python, Ruby and assorted other programming languages. In my experience, it's blindly fast, and removed quite similar bottlenecks in my code with little fuss and virtually no extra code on my part.
The only complexity comes with overlapping patterns. If you want the patterns to start at word boundaries, you should be able partition the dictionary into a set of regular expressions r_1, r_2, ..., r_k of the form \b(foobar|baz baz\S*|...) where each group in r_{i+1} has a prefix in r_i. You can then short circuit evaluations since if r_{i+1} matches then r_i must have matched.
Unless you're implementing your algorithm in highly-optimized C, I'd bet that this approach will be faster than any of the (algorithmically superior) answers elsewhere in this thread.
Let me get this straight. You have a large set of queries and one small string and you want to find instances of all those queries in that string.
In that case I suggest you index that small document like crazy so that your search time is as short as possible. Hell. With that document size I would even consider doing small mutations (to match wildcards and so on) and would index them as well.
I had a very similar task.
Here's how I solved it, the performance is unbelievable
http://www.foibg.com/ijita/vol17/ijita17-2-p03.pdf

most readable way in XPath to write "is value X a member of sequence S"?

XPath 2.0 has some new functions and syntax, relative to 1.0, that work with sequences. Some of theset don't really add to what the language could already do in 1.0 (with node sets), but they make it easier to express the desired logic in ways that are more readable. This increases the chances of the programmer getting the code correct -- and keeping it that way. For example,
empty(s) is equivalent to not(s), but its intent is much clearer when you want to test whether a sequence is empty.
Correction: the effective boolean value of a sequence is in general more complicated than that. E.g. empty((0)) != not((0)). This applies to exists(s) vs. s in a boolean context as well. However, there are domains of s where empty(s) is equivalent to not(s), so the two could be used interchangeably within those domains. But this goes to show that the use of empty() can make a non-trivial difference in making code easier to understand.
Similarly, exists(s) is equivalent to boolean(s) that already existed in XPath 1.0 (or just s in a boolean context), but again is much clearer about the intent.
Quantified expressions; e.g. "some $x in expression satisfies test($x)" would be equivalent to boolean(expression[test(.)]) (although the new syntax is more flexible, in that you don't need to worry about losing the context item because you have the variable to refer to it by).
Similarly, "every $x in expression satisfies test($x)" would be equivalent to not(expression[not(test(.))]) but is more readable.
These functions and syntax were evidently added at no small cost, solely to serve the goal of writing XPath that is easier to map to how humans think. This implies, as experienced developers know, that understandable code is significantly superior to code that is difficult to understand.
Given all that ... what would be a clear and readable way to write an XPath test expression that asks
Does value X occur in sequence S?
Some ways to do it: (Note: I used X and S notation here to indicate the value and the sequence, but I don't mean to imply that these subexpressions are element name tests, nor that they are simple expressions. They could be complicated.)
X = S: This would be one of the most unreadable, since it requires the reader to
think about which of X and S are sequences vs. single values
understand general comparisons, which are not obvious from the syntax
However, one advantage of this form is that it allows us to put the topic (X) before the comment ("is a member of S"), which, I think, helps in readability.
See also CMS's good point about readability, when the syntax or names make the "cardinality" of X and S obvious.
index-of(S, X): This one is clear about what's intended as a value and what as a sequence (if you remember the order of arguments to index-of()). But it expresses more than we need to: it asks for the index, when all we really want to know is whether X occurs in S. This is somewhat misleading to the reader. An experienced developer will figure out what's intended, with some effort and with understanding of the context. But the more we rely on context to understand the intent of each line, the more understanding the code becomes a circular (spiral) and potentially Sisyphean task! Also, since index-of() is designed to return a list of all the indexes of occurrences of X, it could be more expensive than necessary: a smart processor, in order to evaluate X = S, wouldn't necessarily have to find all the contents of S, nor enumerate them in order; but for index-of(S, X), correct order would have to be determined, and all contents of S must be compared to X. One other drawback of using index-of() is that it's limited to using eq for comparison; you can't, for example, use it to ask whether a node is identical to any node in a given sequence.
Correction: This form, used as a conditional test, can result in a runtime error: Effective boolean value is not defined for a sequence of two or more items starting with a numeric value. (But at least we won't get wrong boolean values, since index-of() can't return a zero.) If S can have multiple instances of X, this is another good reason to prefer form 3 or 6.
exists(index-of(X, S)): makes the intent clearer, and would help the processor eliminate the performance penalty if the processor is smart enough.
some $m in S satisfies $m eq X: This one is very clear, and matches our intent exactly. It seems long-winded compared to 1, and that in itself can reduce readability. But maybe that's an acceptable price for clarity. Keep in mind that X and S could potentially be complex expressions themselves -- they're not necessarily just variable references. An advantage is that since the eq operator is explicit, you can replace it with is or any other comparison operator.
S[. eq X]: clearer than 1, but shares the semantic drawbacks of 2: it computes all members of S that are equal to X. Actually, this could return a false negative (incorrect effective boolean value), if X is falsy. E.g. (0, 1)[. eq 0] returns 0 which is falsy, even though 0 occurs in (0, 1).
exists(S[. eq X]): Clearer than 1, 2, 3, and 5. Not as clear as 4, but shorter. Avoids the drawbacks of 5 (or at least most of them, depending on the processor smarts).
I'm kind of leaning toward the last one, at this point: exists(S[. eq X])
What about you... As a developer coming to a complex, unfamiliar XSLT or XQuery or other program that uses XPath 2.0, and wanting to figure out what that program is doing, which would you find easiest to read?
Apologies for the long question. Thanks for reading this far.
Edit: I changed = to eq wherever possible in the above discussion, to make it easier to see where a "value comparison" (as opposed to a general comparison) was intended.
For what it's worth, if names or context make clear that X is a singleton, I'm happy to use your first form, X = S -- for example when I want to check an attribute value against a set of possible values:
<xsl:when test="#type = ('A', 'A+', 'A-', 'B+')" />
or
<xsl:when test="#type = $magic-types"/>
If I think there is a risk of confusion, then I like your sixth formulation. The less frequently I have to remember the rules for calculating an effective boolean value, the less frequently I make a mistake with them.
I prefer this one:
count(distinct-values($seq)) eq count(distinct-values(($x, $seq)))
When $x is itself a sequence, this expression implements the (value-based) subset of relation between two sets of values, that are represented as sequences. This implementation of subset of has just linear time complexity -- vs many other ways of expressing this, that have O(N^2)) time complexity.
To summarize, the question whether a single value belongs to a set of values is a special case of the question whether one set of values is a subset of another. If we have a good implementation of the latter, we can simply use it for answering the former.
The functx library has a nice implementation of this function, so you can use
functx:is-node-in-sequence($X, $Y)
(this particular function can be found at http://www.xqueryfunctions.com/xq/functx_is-node-in-sequence.html)
The whole functx library is available for both XQuery (http://www.xqueryfunctions.com/) and XSLT (http://www.xsltfunctions.com/)
Marklogic ships the functx library with their core product; other vendors may also.
Another possibility, when you want to know whether node X occurs in sequence S, is
exists((X) intersect S)
I think that's pretty readable, and concise. But it only works when X and the values in S are nodes; if you try to ask
exists(('bob') intersect ('alice', 'bob'))
you'll get a runtime error.
In the program I'm working on now, I need to compare strings, so this isn't an option.
As Dimitri notes, the occurrence of a node in a sequence is a question of identity, not of value comparison.

Algorithms to represent a set of integers with only one integer

This may not be a programming question but it's a problem that arised recently at work. Some background: big C development with special interest in performance.
I've a set of integers and want to test the membership of another given integer. I would love to implement an algorithm that can check it with a minimal set of algebraic functions, using only a integer to represent the whole space of integers contained in the first set.
I've tried a composite Cantor pairing function for instance, but with a 30 element set it seems too complicated, and focusing in performance it makes no sense. I played with some operations, like XORing and negating, but it gives me low estimations on membership. Then I tried with successions of additions and finally got lost.
Any ideas?
For sets of unsigned long of size 30, the following is one fairly obvious way to do it:
store each set as a sorted array, 30 * sizeof(unsigned long) bytes per set.
to look up an integer, do a few steps of a binary search, followed by a linear search (profile in order to figure out how many steps of binary search is best - my wild guess is 2 steps, but you might find out different, and of course if you test bsearch and it's fast enough, you can just use it).
So the next question is why you want a big-maths solution, which will tell me what's wrong with this solution other than "it is insufficiently pleasing".
I suspect that any big-math solution will be slower than this. A single arithmetic operation on an N-digit number takes at least linear time in N. A single number to represent a set can't be very much smaller than the elements of the set laid end to end with a separator in between. So even a linear search in the set is about as fast as a single arithmetic operation on a big number. With the possible exception of a Goedel representation, which could do it in one division once you've found the nth prime number, any clever mathematical representation of sets is going to take multiple arithmetic operations to establish membership.
Note also that there are two different reasons you might care about the performance of "look up an integer in a set":
You are looking up lots of different integers in a single set, in which case you might be able to go faster by constructing a custom lookup function for that data. Of course in C that means you need either (a) a simple virtual machine to execute that "function", or (b) runtime code generation, or (c) to know the set at compile time. None of which is necessarily easy.
You are looking up the same integer in lots of different sets (to get a sequence of all the sets it belongs to), in which case you might benefit from a combined representation of all the sets you care about, rather than considering each set separately.
I suppose that very occasionally, you might be looking up lots of different integers, each in a different set, and so neither of the reasons applies. If this is one of them, you can ignore that stuff.
One good start is to try Bloom Filters.
Basically, it's a probabilistic data structure that gives you no false negative, but some false positive. So when an integer matches a bloom filter, you then have to check if it really matches the set, but it's a big speedup by reducing a lot the number of sets to check.
if i'd understood your correctly, python example:
>>> a=[1,2,3,4,5,6,7,8,9,0]
>>>
>>>
>>> len_a = len(a)
>>> b = [1]
>>> if len(set(a) - set(b)) < len_a:
... print 'this integer exists in set'
...
this integer exists in set
>>>
math base: http://en.wikipedia.org/wiki/Euler_diagram

Will this obfuscation algorithm for a URL shortener work?

DISCLAIMER: I am not asking how to make a URL shortener (I have already implemented the "bijective function" answer found HERE that uses a base-62 encoded string). Instead, I want to expand this implementation to obfuscate the generated string so that it is both:
A) not an easily guessable sequence, and
B) still bijective.
You can easily randomize your base-62 character set, but the problem is that it still increments like any other number in any other base. For example, one possible incremental progression might be {aX9fgE, aX9fg3, aX9fgf, aX9fgR, … ,}
I have come up with an obfuscation technique that I am pleased with in terms of requirement A), but I'm only partially sure that it satisfies B). The idea is this:
The only thing that is guaranteed to change in the incremental approach is the "1's place" (I'll use decimal terminology for practicality reasons). In the sample progression I gave earlier, that would be {E, 3, f, R, …}. So if each character in the base-62 set had its own unique offset number (say, its distance from the "zero character"), then you could apply the offset of the "1's place" character to the rest of the string.
For instance, let's assume a base-5 set with characters {A, f, 9, p, Z, 3} (in ascending order from 0 to 5). Each one would then have a unique offset of 0 to 5 respectively. Counting would look like {A, f, 9, p, Z, 3, fA, ff, f9, fp, …} and so on. So the algorithm, when given a value of fZ3p, would look at the p and, having an offset of +3, would permute the string into Zf9p (assuming the base-5 set is a circular array). The next incremental number would be fZ3Z, and with Z's offset being +4, the algorithm returns 39pZ. These permutated results would be handed off to the user as his/her "unique URL", who would never see the actual base-62 encoded string.
This approach certainly seems reversible; just look at the last character, and perform the same permutation with the negative offset. And I'm thinking that for this reason, it has to still be bijective. But I don't know if this is necessarily true? Are there any edge/corner cases I'm not considering?
EDIT : My intentions are more heavily weighed towards the length of the shortened-URL rather than the security of the pattern. I realize there are plenty of solutions involving cryptographic functions, block ciphers, etc. But I would like to emphasize that I am not asking the best way to achieve A), but rather, "is my offset-approach satisfying B)".
Any holes you can find would be appreciated.
If you honestly want them to be hard to guess, keep it simple.
Start with a normal encryption algorithm running in counter mode. When you get a URL to shorten, increment your counter, encrypt it, convert the result to something using printable characters (e.g., base 64) and put the original URL and the shortened version into your table so you can get the original URL from the shortened version when needed.
The only real question at that point is what encryption algorithm to use. That, in turn, depends on your threat model. I don't see exactly what you gain by making shortened URLs hard to guess, so I'm a bit uncertain about the threat model.
If you want to make it mildly difficult to guess, you could use something like a 40-bit version of RC4. This is pretty easy to break, but enough to keep most people from bothering.
If you want a bit more security, you could step up to DES. That's been broken, but even at this late date breaking it is quite a bit of work.
If you want more security than that, you can use AES.
Note that as you increase the security, the shortened URL gets longer though. RC4-40 starts out with 5 bytes, DES 7 bytes, and AES with 32 bytes. Depending on how you convert to printable text, that's going to expand at least a little.
Another option is to use the Luby-Rackoff construction (see also here), which is a way to generate a pseudo-random permutation from a pseudo-random function.
You just have to pick a "round function" F. F must take as input a key K and a block of bits half the size of what you are encoding. F must produce as output a block of bits also half the size of whatever you are encoding.
Then you just run the Luby-Rackoff construction (aka. "Feistel network") for four rounds, each round using a different K.
The construction guarantees that the result is a bijective map, and it will be hard to invert provided that F is hard to invert.
I tried to solve the same problem (in php) and ended up with those functions:
hashing the row id with a kind of feistel algo
applying a bijective function to compress the integer
So for the A): it's not easily guessable (to me) as you cant increment a string to get the next record without an algo
And for the B): for what I understand it's 100% bijective.
Thanks to #Nemo for naming the feistel network, which lead me to the first function i have linked to.
If you're trying to avoid people crawling the URLs, I think Nick Johnson has the right idea, that you need to make sure your URL space is not dense.
Here's a simple idea: take your URL, and prepend a few random characters to it. Then run it through a compression algorithm -- I'd try range encoding (you can probably specify the basis if you find a good library). This should be decompressible to the original form, and should both impact locality and make the encoded space more sparse.
That said, I imagine nearly all URL shorteners out there keep a hash table with state on the server side. How else are you going to losslessly compress a hundred-character URL into 5 or 6 characters?

Resources