Scalability of aho corasick - algorithm

I want to search a text document for occurrences of keyphrases from a database of keyphrases (extracted from wikipedia article titles). (ie. given a document i want to find whether any of the phrases have a corresponding wikipedia article) I found out about the Aho-Corasick algorithm. I want to know if building an Aho-Corasick automaton for dictionary of millions of entries is efficient and scalable.

Let's just make a simple calculations :
Assume that you have 1 million patterns (strings, phrases) with average length 10 characters and a value (label, token, pointer etc) of 1 word (4 bytes) length , assigned to each pattern
Then you will need an array of 10+4=14 million bytes (14Mb) just to hold the list of patterns.
From 1 million patterns 10 bytes (letters, chars) each you could build an AC trie with no more than 10 million nodes. How big is this trie in practice depends on the size of each node.
It should at least keep 1 byte for a label (letter) and word (4bytes) for a pointer to a next node in trie (or a pattern for a terminal node) plus 1 bit (boolean) to mark terminal node,
Total about 5 bytes
So, the minimum size of a trie for 1 million patterns 10 chars you will need min 50 million bytes or about 50 Mb of memory.
In practice it might be 3-10 times more , but yet is very-very manageable, as even 500Mb memory is very moderate today. (Compare it with Windows applications like Word or Outlook)
Given that in terms of speed Aho-Corasick (AC) algorithm is almost unbeatable, it still remains the best algorithm for multiple pattern match ever. That's my strong personal educated opinion apart from academic garbage .
All reports of "new" latest and greatest algorithms that might outperform AC are highly exaggerated (except maybe for some special cases with short patterns like DNA)
The only improvement of AC could in practice go along the line of more and faster hardware (multiple cores, faster CPUs, clusters etc)
Don't take my word for it, test it for yourself. But remember that real speed of AC strongly depends on implementation (language and quality of coding)

In theory, it should maintain linear speed subject only to the effects of the memory hierarchy - it will slow down as it gets too big to fit in cache, and when it gets really big, you'll have problems if it starts getting paged out.
OTOH the big win with Aho-Corasick is when searching for decent sized substrings that may occur at any possible location within the string being fed in. If your text document is already cut up into words, and your search phrases are no more than e.g. 6 words long, then you could build a hash table of K-word phrases, and then look up every K-word contiguous section of words from the input text in it, for K = 1..6.
(Answer to comment)
Aho-Corasick needs to live in memory, because you will be following pointers all over the place. If you have to work outside memory, it's probably easiest to go back to old-fashioned sort/merge. Create a file of K-word records from the input data, where K is the maximum number of words in any phrase you are interested in. Sort it, and then merge it against a file of sorted Wikipedia phrases. You can probably do this almost by hand on Unix/Linux, using utilities such as sort and join, and a bit of shell/awk/perl/whatever. See also http://en.wikipedia.org/wiki/Key_Word_in_Context (I'm old enough to have actually used one of these indexes, provided as bound pages of computer printout).

Well there is a workaround. By writing the built AC trie of dictionary into a text file in a xml-like format, making an index file for the first 6 levels of that trie, etc... In my tests I search for all partial matches of a sentence in the dictionary (500'000 entries), and I get ~150ms for ~100 results for a sentence of 150-200 symbols.
For more details, check out this paper : http://212.34.233.26/aram/IJITA17v2A.Avetisyan.doc

There are other ways to get performance:
- condense state transitions: you can get them down to 32 bits.
- ditch the pointers; write the state transitions to a flat vector.
- pack nodes near the tree root together: they will be in cache.
The implementation takes about 3 bytes per char of the original pattern set,
and for 32-bit nodes, can take a pattern space of about 10M chars.
For 64-bit nodes, have yet to hit (or figure) the limit.
Doc: https://docs.google.com/document/d/1e9Qbn22__togYgQ7PNyCz3YzIIVPKvrf8PCrFa74IFM/view
Src: https://github.com/mischasan/aho-corasick

Related

What is the fastest way to intersect two large set of ids

The Problem
On a server, I host ids in a json file. From clients, I need to mandate the server to intersect and sometimes negate these ids (the ids never travel to the client even though the client instructs the server its operations to perform).
I typically have 1000's of ids, often have 100,000's of ids, and have a maximum of 56,000,000 of them, where each value is unique and between -100,000,000 and +100,000,000.
These ids files are stable and do not change (so it is possible to generate a different representation for it that is better adapted for the calculations if needed).
Sample ids
Largest file sizes
I need an algorithm that will intersect ids in the sub-second range for most cases. What would you suggest? I code in java, but do not limit myself to java for the resolution of this problem (I could use JNI to bridge to native language).
Potential solutions to consider
Although you could not limit yourselves to the following list of broad considerations for solutions, here is a list of what I internally debated to resolve the situation.
Neural-Network pre-qualifier: Train a neural-network for each ids list that accepts another list of ids to score its intersection potential (0 means definitely no intersection, 1 means definitely there is an intersection). Since neural networks are good and efficient at pattern recognition, I am thinking of pre-qualifying a more time-consuming algorithm behind it.
Assembly-language: On a Linux server, code an assembly module that does such algorithm. I know that assembly is a mess to maintain and code, but sometimes one need the speed of an highly optimized algorithm without the overhead of a higher-level compiler. Maybe this use-case is simple enough to benefit from an assembly language routine to be executed directly on the Linux server (and then I'd always pay attention to stick with the same processor to avoid having to re-write this too often)? Or, alternately, maybe C would be close enough to assembly to produce clean and optimized assembly code without the overhead to maintain assembly code.
Images and GPU: GPU and image processing could be used and instead of comparing ids, I could BITAND images. That is, I create a B&W image of each ids list. Since each id have unique values between -100,000,000 and +100,000,000 (where a maximum of 56,000,000 of them are used), the image would be mostly black, but the pixel would become white if the corresponding id is set. Then, instead of keeping the list of ids, I'd keep the images, and do a BITAND operation on both images to intersect them. This may be fast indeed, but then to translate the resulting image back to ids may be the bottleneck. Also, each image could be significantly large (maybe too large for this to be a viable solution). An estimate of a 200,000,000 bits sequence is 23MB each, just loading this in memory is quite demanding.
String-matching algorithms: String comparisons have many adapted algorithms that are typically extremely efficient at their task. Create a binary file for each ids set. Each id would be 4 bytes long. The corresponding binary file would have each and every id sequenced as their 4 bytes equivalent into it. The algorithm could then be to process the smallest file to match each 4 bytes sequence as a string into the other file.
Am I missing anything? Any other potential solution? Could any of these approaches be worth diving into them?
I did not yet try anything as I want to secure a strategy before I invest what I believe will be a significant amount of time into this.
EDIT #1:
Could the solution be a map of hashes for each sector in the list? If the information is structured in such a way that each id resides within its corresponding hash key, then, the smaller of the ids set could be sequentially ran and matching the id into the larger ids set first would require hashing the value to match, and then sequentially matching of the corresponding ids into that key match?
This should make the algorithm an O(n) time based one, and since I'd pick the smallest ids set to be the sequentially ran one, n is small. Does that make sense? Is that the solution?
Something like this (where the H entry is the hash):
{
"H780" : [ 45902780, 46062780, -42912780, -19812780, 25323780, 40572780, -30131780, 60266780, -26203780, 46152780, 67216780, 71666780, -67146780, 46162780, 67226780, 67781780, -47021780, 46122780, 19973780, 22113780, 67876780, 42692780, -18473780, 30993780, 67711780, 67791780, -44036780, -45904780, -42142780, 18703780, 60276780, 46182780, 63600780, 63680780, -70486780, -68290780, -18493780, -68210780, 67731780, 46092780, 63450780, 30074780, 24772780, -26483780, 68371780, -18483780, 18723780, -29834780, 46202780, 67821780, 29594780, 46082780, 44632780, -68406780, -68310780, -44056780, 67751780, 45912780, 40842780, 44642780, 18743780, -68220780, -44066780, 46142780, -26193780, 67681780, 46222780, 67761780 ],
"H782" : [ 27343782, 67456782, 18693782, 43322782, -37832782, 46152782, 19113782, -68411782, 18763782, 67466782, -68400782, -68320782, 34031782, 45056782, -26713782, -61776782, 67791782, 44176782, -44096782, 34041782, -39324782, -21873782, 67961782, 18703782, 44186782, -31143782, 67721782, -68340782, 36103782, 19143782, 19223782, 31711782, 66350782, 43362782, 18733782, -29233782, 67811782, -44076782, -19623782, -68290782, 31721782, 19233782, 65726782, 27313782, 43352782, -68280782, 67346782, -44086782, 67741782, -19203782, -19363782, 29583782, 67911782, 67751782, 26663782, -67910782, 19213782, 45992782, -17201782, 43372782, -19992782, -44066782, 46142782, 29993782 ],
"H540" : [...
You can convert each file (list of ids) into a bit-array of length 200_000_001, where bit at index j is set if the list contains value j-100_000_000. It is possible, because the range of id values is fixed and small.
Then you can simply use bitwise and and not operations to intersect and negate lists of ids. Depending on the language and libraries used, it would require operating element-wise: iterating over arrays and applying corresponding operations to each index.
Finally, you should measure your performance and decide whether you need to do some optimizations, such as parallelizing operations (you can work on different parts of arrays on different processors), preloading some of arrays (or all of them) into memory, using GPU, etc.
First, the bitmap approach will produce the required performance, at a huge overhead in memory. You'll need to benchmark it, but I'd expect times of maybe 0.2 seconds, with that almost entirely dominated by the cost of loading data from disk, and then reading the result.
However there is another approach that is worth considering. It will use less memory most of the time. For most of the files that you state, it will perform well.
First let's use Cap'n Proto for a file format. The type can be something like this:
struct Ids {
is_negated #0 :Bool;
ids #1 :List(Int32);
}
The key is that ids are always kept sorted. So list operations are a question of running through them in parallel. And now:
Applying not is just flipping is_negated.
If neither is negated, it is a question of finding IDs in both lists.
If the first is not negated and the second is, you just want to find IDs in the first that are not in the second.
If the first is negated and the second is not, you just want to find IDs in the second that are not in the first.
If both are negated, you just want to find all ids in either list.
If your list has 100k entries, then the file will be about 400k. A not requires copying 400k of data (very fast). And intersecting with another list of the same size involves 200k comparisons. Integer comparisons complete in a clock cycle, and branch mispredictions take something like 10-20 clock cycles. So you should be able to do this operation in the 0-2 millisecond range.
Your worst case 56,000,000 file will take over 200 MB and intersecting 2 of them can take around 200 million operations. This is in the 0-2 second range.
For the 56 million file and a 10k file, your time is almost all spent on numbers in the 56 million file and not in the 10k one. You can speed that up by adding a "galloping" mode where you do a binary search forward in the larger file looking for the next matching number and picking most of them. Do be warned that this code tends to be tricky and involves lots of mispredictions. You'll have to benchmark it to find out how big a size difference is needed.
In general this approach will lose for your very biggest files. But it will be a huge win for most of the sizes of file that you've talked about.

Find the most repeated character out of 4 strings

Let's say I have a document & the document is spread across 4 different machines, I would like to get a character which has the highest repeated count (all 4 machines combined).
One approach I have is to use a hashmap in each machine and calculate the frequency on each machine individually and then pass that hashmap to the main server where hashmaps from all the 4 machines will be merged.
Thus we'll get the character with the highest frequency.
But the cache here is that I want to minimize the data transferred from each machine.
What improvements can be made ?
[EDIT]
Each machine holds a part of the document
If you don't mind it taking longer...
Each computer passes the most frequent character(s). Hopefully, the number of characters with the highest frequency is low. Ideally, it would be almost always only one.
Main server combines them into a set. If the set has a single character done. Otherwise this set is passed along to the computers, likely as an array or list. Assuming only one character from each computer, this list would have only 2-4 characters.
Each computer returns the frequencies of each character in the set.
Main server sums the frequencies, obtaining the most frequent.
I assert that without prior knowledge of the distribution of characters in the document then any approach you take will have to reduce the data from all 4 computers onto one of them. To minimise the data transferred it is necessary to minimise the size of the data structure which holds the character counts on each computer.
Supposing that you are working with an alphabet with N characters your problem is now the design of a data structure which can hold N integers (in some range [0..m], m being the number of characters in the alphabet) and there is any number of such data structures to be found.
Of course, if you have prior knowledge of the distribution of characters, for example if you know that it is pure text written in English, you have a range of possible approaches to data compression.
Given the relatively small values for N and m likely to be found in practice I agree with the general thrust of the commentary, that it is probably not worth devising a complicated structure to minimise the amount of data transferred, sending an array of N integers would be adequate in most conceivable circumstances.

Most frequent words in a terabyte of data

I came across a problem where we have to find say the most 10 frequent words in a terabyte of file or string.
One solution I could think was using a hash table (word, count) along with a max heap. But fitting all the words if the words are unique might cause a problem.
I thought of another solution using Map-Reduce by splitting the chunks on different nodes.
Another solution would be to build a Trie for all the words and update the count of each word as we scan through the file or string.
Which one of the above would be a better solution? I think the first solution is pretty naive.
Split your available memory into two halves. Use one as a 4-bit counting Bloom filter and the other half as a fixed size hash table with counts. The role of the counting Bloom filter is to filter out rarely occuring words with high memory efficiency.
Check your 1 TB of words against the initially empty Bloom filter; if a word is already in and all buckets are set to the maximum value of 15 (this may be partly or wholly a false positive), pass it through. If it is not, add it.
Words that passed through get counted; for a majority of words, this is every time but the first 15 times you see them. A small percentage will start to get counted even sooner, bringing a potential inaccuracy of up to 15 occurrences per word into your results. That's a limitation of Bloom filters.
When the first pass is over, you can correct the inaccuracy with a second pass if desired. Deallocate the Bloom filter, deallocate also all counts that are not within 15 occurrences behind the tenth most frequent word. Go through the input again, this time accurately counting words (using a separate hash table), but ignoring words that have not been retained as approximate winners from the first pass.
Notes
The hash table used in the first pass may theoretically overflow with certain statistical distributions of the input (e.g., each word exactly 16 times) or with extremely limited RAM. It is up to you to calculate or try out whether this can realistically happen to you or not.
Note also that the bucket width (4 bits in the above description) is just a parameter of the construction. A non-counting Bloom filter (bucket width of 1) would filter out most unique words nicely, but do nothing to filter out other very rarely occuring words. A wider bucket size will be more prone to cross-talk between words (because there will be fewer buckets), and it will also reduce guaranteed accuracy level after the first pass (15 occurrences in the case of 4 bits). But these downsides will be quantitatively insignificant until some point, while I'm imagining the more aggresive filtering effect as completely crucial for keeping the hash table in sub-gigabyte sizes with non-repetitive natural language data.
As for the order of magnitude memory needs of the Bloom filter itself; these people are working way below 100 MB, and with a much more challenging application ("full" n-gram statistics, rather than threshold 1-gram statistics).
Sort the terabyte file alphabetically using mergesort. In the initial pass, use quick sort using all available physical RAM to pre-sort long runs of words.
When doing so, represent a continuous sequence of identical words by just one such word and a count. (That is, you are adding the counts during the merges.)
Then resort the file, again using mergesort with quick sort presorting, but this time by the counts rather than alphabetically.
This is slower but simpler to implement than my other answer.
The best I could think of:
Split data to parts you can store in memory.
For each part get N most frequent words, you will get N * partsNumber words.
Read all data again counting words you got before.
It won't always give you correct answer, but it will work in fixed memory and linear time.
And why do you think a building of the Trie structure is not the best decision? Mark all of child nodes by a counter and that's it! Maximum memory complexity will be O(26 * longest_word_length), and time complexity should be O(n), that's not bad, is it?

What are some algorithms or methods for highly efficient search indexes over random data?

I need a "point of departure" to research options for highly efficient search algorithms, methods and techniques for finding random strings within a massive amount of random data. I'm just learning about this stuff, so anyone have experience with this? Here's some conditions I want to optimize for:
The first idea is to minimize file size in terms of search indexes and the like - so the smallest possible index, or even better - search on the fly.
The data to be searched is high amounts of entirely random data - say, random binary 0s and 1s with no perceptable pattern. Gigabytes of the stuff.
Presented with an equally random search string, say 0111010100000101010101 what is the most efficient way to locate that same string within a mountain of random data? What are the tradeoffs in performance, etc?
All instances of that search string need to be located, so that seems like an important condition that limits the types of solutions to be implemented.
Any hints, clues, techniques, wiki articles etc. would be greatly appreciated! I'm just studying this now, and it seems interesting. Thanks.
A simple way to do this is to build an index on all possible N-byte substrings of the searchable data (with N = 4 or 8 or something like that). The index would map from the small chunk to all locations where that chunk occurs.
When you want to lookup a value, take the first N bytes and use them to find all possible locations. You need to verify all locations of course.
A high value for N means more index space usage and faster lookups because less false positives will be found.
Such an index is likely to be a small multiple of the base data in size.
A second way would be to split the searchable data into contiguous, non-overlapping chunks of N bytes (N = 64 or so). Hash each chunk down to a smaller size M (M = 4 or 8 or so).
This saves a lot of index space because you don't need all the overlapping chunks.
When you lookup a value you can locate the candidate matches by looking up all contiguous, overlapping substrings of the string to be found. This assumes that the string to be found is at least N * 2 bytes in size.

Finding sets that have specific subsets

I am a graduate student of physics and I am working on writing some code to sort several hundred gigabytes of data and return slices of that data when I ask for it. Here is the trick, I know of no good method for sorting and searching data of this kind.
My data essentially consists of a large number of sets of numbers. These sets can contain anywhere from 1 to n numbers within them (though in 99.9% of the sets, n is less than 15) and there are approximately 1.5 ~ 2 billion of these sets (unfortunately this size precludes a brute force search).
I need to be able to specify a set with k elements and have every set with k+1 elements or more that contains the specified subset returned to me.
Simple Example:
Suppose I have the following sets for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)
If I were to give the request (1,3) I would have the sets: (1,2,3),
(1,2,3,4,5), and (1,3,8,9).
The request (11) would return the set: (5,8,11).
The request (1,2,3) would return the sets: (1,2,3) and (1,2,3,4,5)
The request (50) would return no sets:
By now the pattern should be clear. The major difference between this example and my data is that the sets withn my data are larger, the numbers used for each element of the sets run from 0 to 16383 (14 bits), and there are many many many more sets.
If it matters I am writing this program in C++ though I also know java, c, some assembly, some fortran, and some perl.
Does anyone have any clues as to how to pull this off?
edit:
To answer a couple questions and add a few points:
1.) The data does not change. It was all taken in one long set of runs (each broken into 2 gig files).
2.) As for storage space. The raw data takes up approximately 250 gigabytes. I estimate that after processing and stripping off a lot of extraneous metadata that I am not interested in I could knock that down to anywhere from 36 to 48 gigabytes depending on how much metadata I decide to keep (without indices). Additionally if in my initial processing of the data I encounter enough sets that are the same I might be able to comress the data yet further by adding counters for repeat events rather than simply repeating the events over and over again.
3.) Each number within a processed set actually contains at LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). So I will need at LEAST three bytes per number.
4.) My "though in 99.9% of the sets, n is less than 15" comment was misleading. In a preliminary glance through some of the chunks of the data I find that I have sets that contain as many as 22 numbers but the median is 5 numbers per set and the average is 6 numbers per set.
5.) While I like the idea of building an index of pointers into files I am a bit leery because for requests involving more than one number I am left with the semi slow task (at least I think it is slow) of finding the set of all pointers common to the lists, ie finding the greatest common subset for a given number of sets.
6.) In terms of resources available to me, I can muster approximately 300 gigs of space after I have the raw data on the system (The remainder of my quota on that system). The system is a dual processor server with 2 quad core amd opterons and 16 gigabytes of ram.
7.) Yes 0 can occur, it is an artifact of the data acquisition system when it does but it can occur.
Your problem is the same as that faced by search engines. "I have a bajillion documents. I need the ones which contain this set of words." You just have (very conveniently), integers instead of words, and smallish documents. The solution is an inverted index. Introduction to Information Retrieval by Manning et al is (at that link) available free online, is very readable, and will go into a lot of detail about how to do this.
You're going to have to pay a price in disk space, but it can be parallelized, and should be more than fast enough to meet your timing requirements, once the index is constructed.
Assuming a random distribution of 0-16383, with a consistent 15 elements per set, and two billion sets, each element would appear in approximately 1.8M sets. Have you considered (and do you have the capacity for) building a 16384x~1.8M (30B entries, 4 bytes each) lookup table? Given such a table, you could query which sets contain (1) and (17) and (5555) and then find the intersections of those three ~1.8M-element lists.
My guess is as follows.
Assume that each set has a name or ID or address (a 4-byte number will do if there are only 2 billion of them).
Now walk through all the sets once, and create the following output files:
A file which contains the IDs of all the sets which contain '1'
A file which contains the IDs of all the sets which contain '2'
A file which contains the IDs of all the sets which contain '3'
... etc ...
If there are 16 entries per set, then on average each of these 2^16 files will contain the IDs of 2^20 sets; with each ID being 4 bytes, this would require 2^38 bytes (256 GB) of storage.
You'll do the above once, before you process requests.
When you receive requests, use these files as follows:
Look at a couple of numbers in the request
Open up a couple of the corresponding index files
Get the list of all sets which exist in both these files (there's only a million IDs in each file, so this should't be difficult)
See which of these few sets satisfy the remainder of the request
My guess is that if you do the above, creating the indexes will be (very) slow and handling requests will be (very) quick.
I have recently discovered methods that use Space Filling curves to map the multi-dimensional data down to a single dimension. One can then index the data based on its 1D index. Range queries can be easily carried out by finding the segments of the curve that intersect the box that represents the curve and then retrieving those segments.
I believe that this method is far superior to making the insane indexes as suggested because after looking at it, the index would be as large as the data I wished to store, hardly a good thing. A somewhat more detailed explanation of this can be found at:
http://www.ddj.com/184410998
and
http://www.dcs.bbk.ac.uk/~jkl/publications.html
Make 16383 index files, one for each possible search value. For each value in your input set, write the file position of the start of the set into the corresponding index file. It is important that each of the index files contains the same number for the same set. Now each index file will consist of ascending indexes into the master file.
To search, start reading the index files corresponding to each search value. If you read an index that's lower than the index you read from another file, discard it and read another one. When you get the same index from all of the files, that's a match - obtain the set from the master file, and read a new index from each of the index files. Once you reach the end of any of the index files, you're done.
If your values are evenly distributed, each index file will contain 1/16383 of the input sets. If your average search set consists of 6 values, you will be doing a linear pass over 6/16383 of your original input. It's still an O(n) solution, but your n is a bit smaller now.
P.S. Is zero an impossible result value, or do you really have 16384 possibilities?
Just playing devil's advocate for an approach which includes brute force + index lookup :
Create an index with the min , max and no of elements of sets.
Then apply brute force excluding sets where max < max(set being searched) and min > min (set being searched)
In brute force also exclude sets whole element count is less than that of the set being searched.
95% of your searches would really be brute forcing a very smaller subset. Just a thought.

Resources