Finding all matches from a large set of Strings in a smaller String - algorithm

I have a large set of Words and Phrases (a lexicon or dictionary) which includes wildcards. I need to find all instances of those Words and Phrases within a much smaller String (~150 characters at the moment).
Initially, I wanted to run the operation in reverse; that is to check to see if each word in my smaller string exists within the Lexicon, which could be implemented as a Hash Table. The problem is that some of these values in my Lexicon are not single words and many are wildcards (e.g. substri*).
I'm thinking of using the Rabin-Karp algorithm but I'm not sure this is the best choice.
What's an efficient algorithm or method to perform this operation?
Sample Data:
The dictionary contains hundreds of words and can potentially expand. These words may end with wildcard characters (asterisks). Here are some random examples:
good
bad
freed*
careless*
great loss
The text we are analyzing (at this point) are short, informal (grammar-wise) English statements. The prime example of text (again, at this point in time) would be a Twitter Tweet. These are limited to roughly 140 Characters. For example:
Just got the Google nexus without a contract. Hands down its the best phone
I've ever had and the only thing that could've followed my N900.
While it may be helpful to note that we are performing very simple Sentiment Analysis on this text; our Sentiment Analysis technique is not my concern. I'm simply migrating an existing solution to a "real-time" processing system and need to perform some optimizations.

I think this is an excellent use case for the Aho-Corasick string-matching algorithm, which is specifically designed to find all matches of a large set of strings in a single string. It runs in two phases - a first phase in which a matching automaton is created (which can be done in advance and requires only linear time), and a second phase in which the automaton is used to find all matches (which requires only linear time, plus time proportional to the total number of matches). The algorithm can also be adapted to support wildcard searching as well.
Hope this helps!

One answer I wanted to throw out there was the Boyer-Moore search algorithm. It is the algorithm that grep uses. Grep is probably one of the fastest search tools available. In addition, you could use something like GNU Parallel to have grep run in parallel, thus really speeding up the algorithm.
In addition, here is a good article that you might find interesting.

You might still use your original idea, of checking every word in the text against the dictionary. However, in order to run it efficently you need to index the dictionary, to make lookups really fast. A trick used in information retrival systems is to store a so called permuterm index (http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html).
Basically what you want to do is to store in a dictionary every possible permutation of the words (e.g. for house):
house$
ouse$h
use$ho
...
e$hous
This index can then be used to quickly check against wildcard queries. If for example you have ho*e you can look in the permuterm index for a term beginning with e$ho and you would quickly find a match with house.
The search itself is usually done with some logarithmic search strategy (binary search or b-trees) and thus is usally very fast.

So long as the patterns are for complete words: you don't want or to match storage; spaces and punctuation are match anchors, then a simple way to go is translate your lexicon into a scanner generator input (for example you could use flex), generate a scanner, and then run it over your input.
Scanner generators are designed to identify occurrences of tokens in input, where each token type is described by a regular expression. Flex and similar programs create scanners rapidly. Flex handles up to 8k rules (in your case lexicon entries) by default, and this can be expanded. The generated scanners run in linear time and in practice are very fast.
Internally, the token regular expressions are transformed in a standard "Kleene's theorem" pipeline: First to a NFA, then to a DFA. The DFA is then transformed to its unique minimum form. This is encoded in a HLL table, which is emitted inside a wrapper that implements the scanner by referring to the table. This is what flex does, but other strategies are possible. For example the DFA can be translated to goto code where the DFA state is represented implicitly by the instruction pointer as the code runs.
The reason for the spaces-as-anchors caveat is that scanners created by programs like Flex are normally incapable of identifying overlapping matches: strangers could not match both strangers and range, for example.
Here is a flex scanner that matches the example lexicon you gave:
%option noyywrap
%%
"good" { return 1; }
"bad" { return 2; }
"freed"[[:alpha:]]* { return 3; }
"careless"[[:alpha:]]* { return 4; }
"great"[[:space:]]+"loss" { return 5; }
. { /* match anything else except newline */ }
"\n" { /* match newline */ }
<<EOF>> { return -1; }
%%
int main(int argc, char *argv[])
{
yyin = argc > 1 ? fopen(argv[1], "r") : stdin;
for (;;) {
int found = yylex();
if (found < 0) return 0;
printf("matched pattern %d with '%s'\n", found, yytext);
}
}
And to run it:
$ flex -i foo.l
$ gcc lex.yy.c
$ ./a.out
Good men can only lose freedom to bad
matched pattern 1 with 'Good'
matched pattern 3 with 'freedom'
matched pattern 2 with 'bad'
through carelessness or apathy.
matched pattern 4 with 'carelessness'

This doesn't answer the algorithm question exactly, but check out the re2 library. There are good interfaces in Python, Ruby and assorted other programming languages. In my experience, it's blindly fast, and removed quite similar bottlenecks in my code with little fuss and virtually no extra code on my part.
The only complexity comes with overlapping patterns. If you want the patterns to start at word boundaries, you should be able partition the dictionary into a set of regular expressions r_1, r_2, ..., r_k of the form \b(foobar|baz baz\S*|...) where each group in r_{i+1} has a prefix in r_i. You can then short circuit evaluations since if r_{i+1} matches then r_i must have matched.
Unless you're implementing your algorithm in highly-optimized C, I'd bet that this approach will be faster than any of the (algorithmically superior) answers elsewhere in this thread.

Let me get this straight. You have a large set of queries and one small string and you want to find instances of all those queries in that string.
In that case I suggest you index that small document like crazy so that your search time is as short as possible. Hell. With that document size I would even consider doing small mutations (to match wildcards and so on) and would index them as well.

I had a very similar task.
Here's how I solved it, the performance is unbelievable
http://www.foibg.com/ijita/vol17/ijita17-2-p03.pdf

Related

Better or Faster: String.Contains vs List.Contains

I know this is a stupid question, but I feel like someone might want to know (or inform a <redacted> co-worker about) this. I am not attaching a specific programming language to this since I think it could apply to all of them. Correct me if I am wrong about that.
Main question: Is it faster and/or better to look for entries in a constant String or in a List<String>?
Details: Let's say that I want to see if a given extension is in a list of supported extensions. Which of the following is the best (in regards to programming style) and/or fastest:
const String SUPPORTED = ".exe.bin.sh.png.bmp" /*etc*/;
public static bool IsSupported(String ext){
ext = Normalize(ext);//put the extension in some expected state like lowercase
//String.Contains
return SUPPORTED.Contains(ext);
}
final static List<String> SUPPORTED = MakeAnImmutableList(".exe", ".bin", ".sh",
".png",".bmp" /*etc*/);
public static bool IsSupported(String ext){
ext = Normalize(ext);//put the extension in some expected state like lowercase
//List.Contains
return SUPPORTED.Contains(ext);
}
First, it is important to note that the solutions are not functionally equivalent. The substring search will return true for strings like x and exe.bin while the List<String>.contains() will not. In that sense the List<String> version is likely to be closer to the semantic you want. Any possible performance comparison should keep that in mind.
Now, on to performance.
Theoretical
From an asymptotic and algorithm complexity point of view, the List<String>.contains() approach will be faster than the other alternative as the length of the strings grows. Conceptually, the String.contains version needs to look for a match at each position in the SUPPORTED String, while the List.contains() version only needs to match starting at the start of each substring - as soon as finds a mismatch in the current candidate, it skips to the next. This is related to the above note that the options aren't functionally equivalent: the String.contains options can in theory match a much wider universe of inputs, so it has to do more work before rejecting candidates.
Complexity-wise, this difference could be something like O(N) for List.contains() versus O(N^2) for String.contains() if you take N to be the number of candidates and assume each candidate has a bounded length, and to the String.contains() method in the usual brute-force "look for a match starting at each position" algorithm. As it turns out, the Java String.contains() implementation isn't exactly doing the basic O(N^2) search, but it isn't doing Boyer-Moore either. In general you can expect that once the substrings get long enough, the List.String approach will be faster.
Close(r) to the Metal
At a closer to the metal perspective, both approaches have their advantages. The String.contains() solution avoids the overhead of iterating over the List elements: the entire call will be spent in the intrinsified String.contains implementation, and all the chars making up the entire SUPPORTED Strings are contiguous, so this is memory-friendly. The List.contains() approach will spend a lot of time doing the double-dereferencing needed to go from each List element to the contained String and then to the contained char[] array, and this is likely to dominate if the strings you are comparing against are very short.
On other hand, the List.contains solution ultimately calls into String.equals which is likely implemented in terms of Arrays.equal(char[], char[]) which is heavily optimized with SSE and AVX intrinsics on x86 platforms and likely to be blazing fast, even compared to the optimized version of String.contains(). So if the Strings become long, again expect List.contains() to pull ahead.
All that said, there is a simple, canonical way to do this quickly: a HashSet<String> with all the candidate strings. That's just a simple String.hash() (which is cached and so often "free") and a single lookup in the hash table.
Well, it can vary from implementation to implementation, but if want to look to this problem in a generalized way, lets see.
If you want to look for a specific sub-string inside a string, let say a file extension inside a immutable string containing different extensions, you just need to traverse the string with extensions once.
In the other hand, with a list of immutable strings, still need to traverse each one of the string in that list plus the overhead of iterate over that list.
As a conclusion, in a generalized way, you can see that using a list to store the strings need more processing.
But, you can look to both solutions by its readability, maintainability, etc. For example, if you want to add or remove new extensions or apply more complex operations, may worth the overhead using a list of string.

GoLang PoS Tagger script taking longer than it should with no output in terminal

This script is compling without errors in play.golang.org: http://play.golang.org/p/Hlr-IAc_1f
But when I run in on my machine, much longer than I expect happens with nothing happening in the terminal.
What I am trying to build is a PartOfSpeech Tagger.
I think the longest part is loading lexicon.txt into a map and then comparing each word with every word there to see if it has already been tagged in the lexicon. The lexicon only contains verbs. But doesn't every word need to be checked to see if it is a verb.
The larger problem is that I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
(Quoting):
I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
I can't speak to any issues in your Go implementation, but I'll address the larger problem of POS tagging in general. It sounds like you're attempting to build a rule-based unigram tagger. To elaborate a bit on those terms:
"unigram" means you're considering each word in the sentence separately. Note that a unigram tagger is inherently limited, in that it cannot disambiguate words which can take on multiple POS tags. E.g., should you tag 'fish' as a noun or a verb? Is 'last' a verb or an adverb?
"rule-based" means exactly what it sounds like: a set of rules to determine the tag for each word. Rule-based tagging is limited in a different way - it requires considerable development effort to assemble a ruleset that will handle a reasonable portion of the ambiguity in common language. This effort might be appropriate if you're working in a language for which we don't have good training resources, but in most common languages, we now have enough tagged text to train high-accuracy tagging models.
State-of-the-art for POS tagging is above 97% accuracy on well-formed newswire text (accuracy on less formal genres is naturally lower). A rule-based tagger will probably perform considerably worse (you'll have to determine the accuracy level needed to meet your requirements). If you want to continue down the rule-based path, I'd recommend reading this tutorial. The code is based on Haskell, but it will help you learn the concepts and issues in rule-based tagging.
That said, I'd strongly recommend you look at other tagging methods. I mentioned the weaknesses of unigram tagging. Related approaches would be 'bigram', meaning that we consider the previous word when tagging word n, 'trigram' (usually the previous 2 words, or the previous word, the current word, and the following word); more generally, 'n-gram' refers to considering a sequence of n words (often, a sliding window around the word we're currently tagging). That context can help us disambiguate 'fish', 'last', 'flies', etc.
E.g., in
We fish
we probably want to tag fish as a verb, whereas in
ate fish
it's certainly a noun.
The NLTK tutorial might be a good reference here. An solid n-gram tagger should get you above 90% accuracy; likely above 95% (again on newswire text).
More sophisticated methods (known as 'structured inference') consider the entire tag sequence as a whole. That is, instead of trying to predict the most probable tag for each word separately, they attempt to predict the most probable sequence of tags for the entire input sequence. Structured inference is of course more difficult to implement and train, but will usually improve accuracy vs. n-gram approaches. If you want to read up on this area, I suggest Sutton and McCallum's excellent introduction.
You've got a large array argument in this function:
func stringInArray(a string, list [214]string) bool{
for _, b := range list{
if b == a{
return true;
}
}
return false
}
The array of stopwords gets copied each time you call this function.
Mostly in Go, you should uses slices rather than arrays most of the time. Change the definition of this to be list []string and define stopWords as a slice rather than an array:
stopWords := []string{
"and", "or", ...
}
Probably an even better approach would be to build a map of the stopWords:
isStopWord := map[string]bool{}
for _, sw := range stopWords {
isStopWord[sw] = true
}
and then you can check if a word is a stopword quickly:
if isStopWord[word] { ... }

How to fuzzily search for a dictionary word?

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)
f(x, y) = 1, if characters x and y are similar
= 0, otherwise
(These "similarities" can be specified by the programmer)
such that, say,
f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1
but,
f('a', 'z') = 0
etc.
Now if we have a query 'cofatyst', the algorithm should report the following matches:
('cot', 0)
('cat', 0)
('catalyst', 0)
where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).
One thing to note is that in the wild, the dictionary is going to be really large.
I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.
Not a very specific advise, I know, but I hope it helps you.
edited: Expanded answer.
With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.
It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.
I am using Fuse JavaScript Library for a project of mine. It is a javascript file which works on JSON dataset. It is quite fast. Have a look at it.
It has implemented a full Bitap algorithm, leveraging a modified version of the Diff, Match & Patch tool by Google(from his site).
The code is simple to understand the algorithm implementation done.

Similar substring fast search

I need to find a substring SIMILAR to a given pattern in a huge string. The source huge string could be up to 100 Mb in length. The pattern is rather short (10-100 chars). The problem is that I need to find not only exact substrings but also similar substrings that differ from the pattern in several chars (the maximum allowed error count is provided as a parameter).
Is there any idea how to speed up the algorithm?
1)There are many Algorithms related to the string searching. One of them is the famous Knuth–Morris–Pratt Algorithm.
2) You may also want to check Regular Expressions ("Regex") in whatever the language you're using. They will sure help you finding substrings 'similar' to the original ones.
i.e. [Java]
String pat = "Home";
String source = "IgotanewHwme";
for(int i = 0; i < pat.length(); i++){
//split around i .. not including char i itself .. instead, replace it with [a-zA-Z] and match using this new pattern.
String new_pat = "("+pat.substring(0, i)+")"+ "[a-zA-Z]" + "("+pat.substring(i+1, pat.length())+")";
System.out.println(new_pat);
System.out.println(source.matches("[a-zA-Z]*"+new_pat+"[a-zA-Z]*"));
}
and I think it's easy to make it accept any number of error counts.
Sounds like you want Fuzzy/Approximate String Matching. Have a look at the Wikipedia page and see if you can't find an algorithm that suits your needs.
You can have a look at the Levenshtein distance, the Needleman–Wunsch algorithm and the Damerau–Levenshtein distance
They give you metrics evaluating the amount of differences between two strings (ie the numbers of addition, deletion, substitution, etc.). They are often used to measure the variation between DNA.
You will easily find implementations in various languages.

Looking for algorithm that reverses the sprintf() function output

I am working on a project that requires the parsing of log files. I am looking for a fast algorithm that would take groups messages like this:
The temperature at P1 is 35F.
The temperature at P1 is 40F.
The temperature at P3 is 35F.
Logger stopped.
Logger started.
The temperature at P1 is 40F.
and puts out something in the form of a printf():
"The temperature at P%d is %dF.", Int1, Int2"
{(1,35), (1, 40), (3, 35), (1,40)}
The algorithm needs to be generic enough to recognize almost any data load in message groups.
I tried searching for this kind of technology, but I don't even know the correct terms to search for.
I think you might be overlooking and missed fscanf() and sscanf(). Which are the opposite of fprintf() and sprintf().
Overview:
A naïve!! algorithm keeps track of the frequency of words in a per-column manner, where one can assume that each line can be separated into columns with a delimiter.
Example input:
The dog jumped over the moon
The cat jumped over the moon
The moon jumped over the moon
The car jumped over the moon
Frequencies:
Column 1: {The: 4}
Column 2: {car: 1, cat: 1, dog: 1, moon: 1}
Column 3: {jumped: 4}
Column 4: {over: 4}
Column 5: {the: 4}
Column 6: {moon: 4}
We could partition these frequency lists further by grouping based on the total number of fields, but in this simple and convenient example, we are only working with a fixed number of fields (6).
The next step is to iterate through lines which generated these frequency lists, so let's take the first example.
The: meets some hand-wavy criteria and the algorithm decides it must be static.
dog: doesn't appear to be static based on the rest of the frequency list, and thus it must be dynamic as opposed to static text. We loop through a few pre-defined regular expressions and come up with /[a-z]+/i.
over: same deal as #1; it's static, so leave as is.
the: same deal as #1; it's static, so leave as is.
moon: same deal as #1; it's static, so leave as is.
Thus, just from going over the first line we can put together the following regular expression:
/The ([a-z]+?) jumps over the moon/
Considerations:
Obviously one can choose to scan part or the whole document for the first pass, as long as one is confident the frequency lists will be a sufficient sampling of the entire data.
False positives may creep into the results, and it will be up to the filtering algorithm (hand-waving) to provide the best threshold between static and dynamic fields, or some human post-processing.
The overall idea is probably a good one, but the actual implementation will definitely weigh in on the speed and efficiency of this algorithm.
Thanks for all the great suggestions.
Chris, is right. I am looking for a generic solution for normalizing any kind of text. The solution of the problem boils down to dynmamically finding patterns in two or more similar strings.
Almost like predicting the next element in a set, based on the previous two:
1: Everest is 30000 feet high
2: K2 is 28000 feet high
=> What is the pattern?
=> Answer:
[name] is [number] feet high
Now the text file can have millions of lines and thousands of patterns. I would like to parse the files very, very fast, find the patterns and collect the data sets that are associated with each pattern.
I thought about creating some high level semantic hashes to represent the patterns in the message strings.
I would use a tokenizer and give each of the tokens types a specific "weight".
Then I would group the hashes and rate their similarity. Once the grouping is done I would collect the data sets.
I was hoping, that I didn't have to reinvent the wheel and could reuse something that is already out there.
Klaus
It depends on what you are trying to do, if your goal is to quickly generate sprintf() input, this works. If you are trying to parse data, maybe regular expressions would do too..
You're not going to find a tool that can simply take arbitrary input, guess what data you want from it, and produce the output you want. That sounds like strong AI to me.
Producing something like this, even just to recognize numbers, gets really hairy. For example is "123.456" one number or two? How about this "123,456"? Is "35F" a decimal number and an 'F' or is it the hex value 0x35F? You're going to have to build something that will parse in the way you need. You can do this with regular expressions, or you can do it with sscanf, or you can do it some other way, but you're going to have to write something custom.
However, with basic regular expressions, you can do this yourself. It won't be magic, but it's not that much work. Something like this will parse the lines you're interested in and consolidate them (Perl):
my #vals = ();
while (defined(my $line = <>))
{
if ($line =~ /The temperature at P(\d*) is (\d*)F./)
{
push(#vals, "($1,$2)");
}
}
print "The temperature at P%d is %dF. {";
for (my $i = 0; $i < #vals; $i++)
{
print $vals[$i];
if ($i < #vals - 1)
{
print ",";
}
}
print "}\n";
The output from this isL
The temperature at P%d is %dF. {(1,35),(1,40),(3,35),(1,40)}
You could do something similar for each type of line you need to parse. You could even read these regular expressions from a file, instead of custom coding each one.
I don't know of any specific tool to do that. What I did when I had a similar problem to solve was trying to guess regular expressions to match lines.
I then processed the files and displayed only the unmatched lines. If a line is unmatched, it means that the pattern is wrong and should be tweaked or another pattern should be added.
After around an hour of work, I succeeded in finding the ~20 patterns to match 10000+ lines.
In your case, you can first "guess" that one pattern is "The temperature at P[1-3] is [0-9]{2}F.". If you reprocess the file removing any matched line, it leaves "only":
Logger stopped.
Logger started.
Which you can then match with "Logger (.+).".
You can then refine the patterns and find new ones to match your whole log.
#John: I think that the question relates to an algorithm that actually recognises patterns in log files and automatically "guesses" appropriate format strings and data for it. The *scanf family can't do that on its own, it can only be of help once the patterns have been recognised in the first place.
#Derek Park: Well, even a strong AI couldn't be sure it had the right answer.
Perhaps some compression-like mechanism could be used:
Find large, frequent substrings
Find large, frequent substring patterns. (i.e. [pattern:1] [junk] [pattern:2])
Another item to consider might be to group lines by edit-distance. Grouping similar lines should split the problem into one-pattern-per-group chunks.
Actually, if you manage to write this, let the whole world know, I think a lot of us would like this tool!
#Anders
Well, even a strong AI couldn't be sure it had the right answer.
I was thinking that sufficiently strong AI could usually figure out the right answer from the context. e.g. Strong AI could recognize that "35F" in this context is a temperature and not a hex number. There are definitely cases where even strong AI would be unable to answer. Those are the same cases where a human would be unable to answer, though (assuming very strong AI).
Of course, it doesn't really matter, since we don't have strong AI. :)
http://www.logparser.com forwards to an IIS forum which seems fairly active. This is the official site for Gabriele Giuseppini's "Log Parser Toolkit". While I have never actually used this tool, I did pick up a cheap copy of the book from Amazon Marketplace - today a copy is as low as $16. Nothing beats a dead-tree-interface for just flipping through pages.
Glancing at this forum, I had not previously heard about the "New GUI tool for MS Log Parser, Log Parser Lizard" at http://www.lizardl.com/.
The key issue of course is the complexity of your GRAMMAR. To use any kind of log-parser as the term is commonly used, you need to know exactly what you're scanning for, you can write a BNF for it. Many years ago I took a course based on Aho-and-Ullman's "Dragon Book", and the thoroughly understood LALR technology can give you optimal speed, provided of course that you have that CFG.
On the other hand it does seem you're possibly reaching for something AI-like, which is a different order of complexity entirely.

Resources