Efficient most common suffix algorithm? - algorithm

I have a few GBs worth of strings, and for every prefix I want to find 10 most common suffixes. Is there an efficient algorithm for that?
An obvious solution would be:
Store sorted list of <string, count> pairs.
Identify by binary search extent for prefix we're searching.
Find 10 highest counts in this extent.
Possibly precompute it for all short prefixes, so it doesn't ever need to look at large portion of data.
I'm not sure if that would actually be efficient at all. Is there a better way I overlooked?
Answers must be real time, but it can take as much preprocessing as necessary.

Place the words in a tree e.g. trie or radix, placing a "number of occurrences" counter for each full word, so you know which nodes are endings and how common they are.
Find the prefix/postfix combos by iteration.
Both these operations are O(n*k) where k is the length of the longest word; this is the same complexity as a hash-table.
The HAT-trie is a cache-conscious version that promises high performance.

Related

Compressing words into one word consisting of them as subwords [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

whats the best way to traverse a large dictionary of words?

lets say I'm looking for a word that may or may not be in a dictionary of 95k words - I Cannot use word length to facilitate search. My question is in regards to the fastest way to find the word without doing a O(n) look up.
Here are my two thoughts:
first, store the words in a hast table, look up of the word is O(1), this seems the best scenario in my mind, but going through different websites using Trie was also suggested, my question regarding this is whether its practical to have a Trie that holds so many words.
The lookup would be O(k) in this case.
So what is the most optimal way of finding a word in a large dictionary?
Optimality depends on your use case - do you care about look up-time or space? (also, do you care about inserting new words?).
The best you can do time-wise is to use a hash table, but for a dictionary, it is space-inefficient. A trie compresses the space requirement because it stores prefixes, not the entire word, but takes longer to look up. So, to answer your question, it is more space efficient to have a trie with a large number of words than a hash table.
If you are just searching for a single word, the cost of setting up a hash table or tree structure would exceed a linear search. These structures become (very) efficient when their costs are amortized over (very) many uses.
If the dictionary is sorted (and why wouldn't a dictionary be?), then you can look for a single word in log(n) time with a binary search through the file, no additional structures needed.
I think the best way to find a word in a dictionary is a B+ tree.And let me explain you the reason.
Lets say you have a root block of 10 strings.The strings in the block are sorted.These 10 strings are followed by a pointer to another cell of 10 strings and that goes one.So the only thing you have to do is just String compare your Key word starting by the First one until you find a word smaller in comparison (StringCompare).
If we take it as standard that each string has next to it a pointer that shows to a cell with words that are smaller in comparison,it will take you 5 steps and 5 comparisons to end to the final bracket of data that will may or may not contain your Key word.
in 5 comparisons + the comparisons in the final bracket you are searching a dictionary of 10*10*10*10*10 words.
The algorithm is of logarithmic speed Log 100000 with base the number of strings in the cell.If each cell has 10 words you need 5 steps.
I must mention that only the Root of the tree must be stored in the Ram memory.All the other blocks can be stored in the hard drive without significant loss in performance because of the few steps.
Hope i explained right :D At least i tried! have fun
Trie is preferable because this data-structure can be faster than hash-table. Hash tables is O(1) only in ideal case, in real world applications collisions can occur. Different types of trie data structure doesn't suffer from this.
Another case is compression. Trie are much more compact than hash table. Hash table require some space for efficient insert operations. If load factor of the hash table are colse to 100% than insert operations takes very long time.
With hash tables you must compare your key with at least one key from the dictionary, key comparison in this case takes O(k) where k in key length. With trie you are doing the same thing, your lookup operations is O(k).
Tries allow ordered traversal, hash tables - don't.
There is many types of tries out there, for example ternary search trie is verty good in this particular case. Array mapped trie are also very fast, compared to regular hash table.

Substring search algorithms (very large haystack, small needle)

I know there are already several similar questions here, but I need some recommendations for my case (couldn't find anything similar).
I have to search a very large amount of data for a substring that would be about a billion times smaller (10 bytes in 10 billion bytes). The haystack doesn't change, so I can bear with large precomputation if needed. I just need the searching part to be as fast as possible.
I found algorithms which take O(n) time (n = haystack size, m = needle size), and the naive search takes O(n+m). As the m in this particular case would be very small, is there any other algorithm I could look into?
Edit:
Thanks everyone for your suggestions!
Some more info -
The data could be considered random bits, so I don't think any kind of indexing / sorting would be possible. The data to be searched can be anything, not english words or anything predictable.
You are looking for the data structure called the Trie or "prefix tree". In short, this data structure encodes all the possible string prefixes which can be found in your corpus.
Here is a paper which searches a DNA sequences for a small substring, using a prefix tree. I imagine that might help you, since your case sounds similar.
If you know a definite limit on the length of the input search string, you can limit the growth of your Trie so that it does not store any prefixes longer than this length max. In this way, you may be able to fit a Trie representing all 10G into less than 10G of memory. Especially for highly repetitive data, any sort of Trie is a compressed data representation. (Or should be, if implemented sanely.) Limiting the Trie depth to the max input search string allows you to limit the memory consumption still further.
It's worth looking at suffix arrays and trees. They both require precomputation and significant memory, but they are better than reverse indexes in the sense that you can search for arbitrary substrings in O(m) (for suffix trees) and O(m + log n) (for suffix arrays with least common prefix info).
If you have a lot of time on your hands, you can look into compressed suffix arrays and succinct CSAs that are compressed versions of your data that are also self-indexing (i.e. the data is also the index, there is no external index). This is really the best of all worlds because not only do you have a compressed version of your data (you can throw the original data away), but it's also indexed! The problem is understanding the research papers and translating them into code :)
If you do not need perfect substring matching, but rather general searching capabilities, check out Lucene.
Prefix/suffix tree are generally the standard, best and most cautious solution for this sort of thing in my opinion. You can't go wrong with them.
But here is a different idea, which resorts to Bloom filters. You probably know what these are, but just in case (and for other people reading this answer): Bloom filters are very small, very compact bit-vectors which approximate set inclusion. If you have a set S and a Bloom filter for that set B(S), then
x ∈ S ⇒ x ∈ B(S)
but the reciprocate is false. This is what is probabilistic about the structure: there is a (quantifiable) probability that the Bloom filter will return a false positive. But approximating inclusion with the Bloom filter is wildly faster than testing it exactly on the set.
(A simple case use: in a lot of applications, the Bloom filter is used, well, as a filter. Checking cache is expensive, because you have to do a hard drive access, so programs like Squid will first check a small Bloom filter in memory, and if the Bloom filter returns a positive result, then Squid will go check the cache. If it was false positive, then that's OK, because Squid will find that out when actually visiting the cache---but the advantage is that the Bloom filter will have spared Squid having to check the cache for a lot of requests where it would have been useless.)
Bloom filters were used with some success in string search. Here is a sketch (I may remember some of the details wrong) of this application. A text file is a sequence of N lines. You are looking for a word composed of M letters (and no word can be spread accross two lines). A preprocessing phase will build ONE Bloom filter for each line, by adding every subsequence of the line to the Bloom filter; for instance, for this line
Touching this dreaded sight, twice seene of vs,
And the corresponding Bloom filter will be created with "T", "To", "Tou" ... "o", "ou", ... "vs,", "s", "s,", ",". (I may have this part wrong. Or you might want to optimize.)
Then when searching for the subword of size M, simply do one very fast check on each of the Bloom filters, and when there is a hit, examine the line closely with the KMP algorithm, for instance. In practice, if you tune your Bloom filters well, the trade-off is remarkable. Searching is incredibly fast because you eliminate all useless lines.
I believe from this concept you could derive a useful scheme for your situation. Right now, I see two evident adaptation:
either cut your data set in many blocks of size K (each with its Bloom filter, like the lines in the previous example);
or use a sort-of dichotomy where you split the set into two subset, each with a Bloom filter, then each subset split into two sub-subsets with their own Bloom filter, etc. (if you are going to add all substrings as suggested with the method I described, this second idea would be a bit prohibitive---except you don't have to add all substrings, only substrings ranging size 1 to 10).
Both ideas can be combined in inventive ways to create multi-layered schemes.
Given Knuth–Morris–Pratt or Boyer–Moore you're not going to do any better, what you should consider is parallelization of your search process.
If you can afford the space (a lot of space!) to create an index, it'd definitely be worth your while indexing small chunks (e.g. four byte blocks) and storing these with their offsets within the haystack - then searches for 10 bytes involve searching for all four byte blocks for the first four bytes of the search term and checking the next six bytes.

Algorithm to find multiple string matches

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.
Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.
I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.
An example of search terms could be a list of Fortune 500 company names.
Ideas?
Don´t reinvent the wheel
This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.
The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.
In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.
And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.
A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.
As in the case of single patterns, there are several algorithms for multiple-pattern matching, and you will have to find the one that fits your purpose best. The paper A fast algorithm for multi-pattern searching (archived copy) does a review of most of them, including Aho-Corasick (which is kind of the multi-pattern version of the Knuth-Morris-Pratt algorithm, with linear complexity) and Commentz-Walter (a combination of Boyer-Moore and Aho-Corasick), and introduces a new one, which uses ideas from Boyer-Moore for the task of matching multiple patterns.
An alternative, hash-based algorithm not mentioned in that paper, is the Rabin-Karp algorithm, which has a worst-case complexity bigger than other algorithms, but compensates it by reducing the linear factor via hashing. Which one is better depends ultimately on your use case. You may need to implement several of them and compare them in your application if you want to choose the fastest one.
Assuming that the large body of text is static english text and you need to match whole words you can try the following (you should really clarify what exactly is a 'match', what kind of text you are looking at etc in your question).
First preprocess the whole document into a Trie or a DAWG.
Trie/Dawg has the following property:
Given a trie/dawg and a search term of length K, you can in O(K) time lookup the data associated with the word (or tell if there is no match).
Using a DAWG could save you more space as compared to a trie. Tries exploit the fact that many words will have a common prefix and DAWGs exploit the common prefix as well as the common suffix property.
In the trie, also maintain exactly the list of positions of the word. For example if the text is
That is that and so it is.
The node for the last t in that will have the list {1,3} and the node for s in is will have the list {2,7} associated.
Now when you get a single word search term, you can walk the trie and get the list of matches for that word easily.
If you get a multiple word search term, you can do the following.
Walk the trie with the first word in the search term. Get the list of matches and insert into a hashTable H1.
Now walk the trie with the second word in the search term. Get the list of matches. For each match position x, check if x-1 exists in the HashTable H1. If so, add x to new hashtable H2.
Walk the trie with the third word, get list of matches. For each match position y, check if y-1 exists in H3, if so add to new hashtable H3.
Continue so forth.
At the end you get a list of matches for the search phrase, which give the positions of the last word of the phrase.
You could potentially optimize the phrase matching step by maintaining a sorted list of positions in the list and doing a binary search: i.e for eg. for each key k in H2, you binary search for k+1 in the sorted list for search term 3 and add k+1 to H3 if you find it etc.
An optimal solution for this problem is to use a suffix tree (or a suffix array). It's essentially a trie of all suffixes of a string. For a text of length O(N), this can be built in O(N).
Then all k occurrences of a string of length m can be answered optimally in O(m + k).
Suffix trees can also be used to efficiently find e.g. the longest palindrome, the longest common substring, the longest repeated substring, etc.
This is the typical data structure to use when analyzing DNA strings which can be millions/billions of bases long.
See also
Wikipedia/Suffix tree
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (Dan Gusfield).
So you have lots of search terms and want to see if any of them are in the document?
Purely algorithmically, you could sort all your possibilities in alphabetical order, join them with pipes, and use them as a regular expression, if the regex engine will look at /ant|ape/ and properly short-circuit the a in "ape" if it didn't find it in "ant". If not, you could do a "precompile" of a regex and "squish" the results down to their minimum overlap. I.e. in the above case /a(nt|pe)/ and so on, recursively for each letter.
However, doing the above is pretty much like putting all your search strings in a 26-ary tree (26 characters, more if also numbers). Push your strings onto the tree, using one level of depth per character of length.
You can do this with your search terms to make a hyper-fast "does this word match anything in my list of search terms" if your search terms number large.
You could theoretically do the reverse as well -- pack your document into the tree and then use the search terms on it -- if your document is static and the search terms change a lot.
Depends on how much optimization you need...
Are the search terms words that you are looking for or can it be full sentances too ?
If it's only words, then i would suggest building a Red-Black Tree from all the words, and then searching for each word in the tree.
If it could be sentances, then it could get a lot more complex... (?)

Algorithm to find common substring across N strings

I'm familiar with LCS algorithms for 2 strings. Looking for suggestions for finding common substrings in 2..N strings. There may be multiple common substrings in each pair. There can be different common substrings in subsets of the strings.
strings: (ABCDEFGHIJKL) (DEF) (ABCDEF) (BIJKL) (FGH)
common strings:
1/2 (DEF)
1/3 (ABCDEF)
1/4 (IJKL)
1/5 (FGH)
2/3 (DEF)
longest common strings:
1/3 (ABCDEF)
most common strings:
1/2/3 (DEF)
This sort of thing is done all the time in DNA sequence analysis. You can find a variety of algorithms for it. One reasonable collection is listed here.
There's also the brute-force approach of making tables of every substring (if you're interested only in short ones): form an N-ary tree (N=26 for letters, 256 for ASCII) at each level, and store histograms of the count at every node. If you prune off little-used nodes (to keep the memory requirements reasonable), you end up with an algorithm that finds all subsequences of length up to M in something like N*M^2*log(M) time for input of length N. If you instead split this up into K separate strings, you can build the tree structure and just read off the answer(s) in a single pass through the tree.
SUffix trees are the answer unless you have really large strings where memory becomes a problem. Expect 10~30 bytes of memory usage per character in the string for a good implementation. There are a couple of open-source implementations too, which make your job easier.
There are other, more succint algorithms too, but they are harder to implement (look for "compressed suffix trees").

Resources