Data Structure used for efficient text matching - data-structures

I am a regular user of the Eclipse IDE. I found that it is really fast in finding out the occurences of a given variable name in a very section of code. I would be interested in how to go about building such mechanism or what are the fastest data structures used to do so or any algorithms. Of course eclipse is just and example. Thank you in advance.

As usual, the answer is, "it depends".
There's the case of efficiently searching for a chunk of matching text in a longer string.
The most common algorithm for this is the Boyer-More string search algorithm, and I believe that most implementations use a simple array of characters.
However, in the case of finding a variable name in the Eclipse editor, that is probably not what's happening. More likely, Eclipse is creating an Abstract Syntax Tree (AST) from your source code, and searching the tree. See for instance, http://www.eclipse.org/articles/Article-JavaCodeManipulation_AST/

Related

What datastructure is the fastest to find the best matching prefix?

Context: I'm working on an analyzer for useragent strings (Yauaa) and as part of this analysis I want to make an educated guess what brand of the device should be reported. I have an implementation that I need to rewrite to be a lot more efficient.
Because I do not want to have a complete list of all devices I want to do the detection based on the prefix of the model.
So I have a dataset with prefixes and the brand that is associated:
"GT-" --> "Samsung"
"LLD-" --> "Huawei"
And then I want to do a .get("GT-1234124") which should result in "Samsung" because that is the "longest matching prefix".
I had a look at the Trie structure but that seems to be for the opposite situation. What I understand is that you start with a set of values and you can efficiently get all the values that starts with the provided prefix.
If I were to implement this from scratch I would use a tree similar to the Trie but walk around it differently. What I'm looking for is a datastructure that does what I need as fast as possible.
What datastructure do you recommend for this usecase?
Is there an existing (proven) implementation I can use?
I did some digging into datastructures and found that essentially the Trie structure is what I need with a different way of walking around the structure.
Since this structure is really simple I created my own implementation that works very well.
See:
https://github.com/nielsbasjes/yauaa/blob/master/analyzer/src/main/java/nl/basjes/parse/useragent/utils/PrefixLookup.java
Updates:
I wrote an article about this https://techlab.bol.com/finding-the-longest-matching-string-prefix-fast/
I put my implementation into a separate library which I opensourced and which is already available via maven central. See https://github.com/nielsbasjes/prefixmap

What are data structures behind letter by letter search?

Recently I've been reading some information regarding various data structures and their use in practice. I am especially interested in those that are used in searching. For example searching suggestions from Google, or searching in Windows.
If the text is fully typed something like hash table should work to find it in O(1). This is because we assume that they are already in the hash table. However, what happens, when we type in every letter and it search only based on letter 1, [1-2], [1-3] ...? Is it some kind of suffix array or trie used in the process?
I think this page describing "string searching" or "string matching" algorithms is what you are looking for:
https://en.wikipedia.org/wiki/String_searching_algorithm

How do I approximate "Did you mean?" without using Google?

I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?

How can I build an incremental directed acyclic word graph to store and search strings?

I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.
A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).
I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.
I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not very helpful.
I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?
I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.
The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.
As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.
My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.
Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.
I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.
Java
For graph problems which require persistence, I'd take a look at the Neo4j graph DB project. Neo4j is designed to store large graphs and allow incremental building and modification of the data, which seems to meet the criteria you describe.
They have some good examples to get you going quickly and there's usually example code to get you started with most problems.
They have a DAG example with a link at the bottom to the full source code.
C++
If you're using C++, a common solution to graph building/analysis is to use the Boost graph library. To persist your graph you could maintain a file based version of the graph in GraphML (for example) and read and write to that file as your graph changes.
You may also want to look at a trie structure for this (potentially building a radix-tree). It seems like a decent 'simple' alternative structure.
I'm suggesting this for a few reasons:
I really don't have a full understanding of your result.
Definitely incremental to build.
Leaf nodes can contain any data you wish.
Subjectively, a simple algorithm.

Fast text editor find

Does anyone know how text editors/programmers editors are able to do such fast searches on very large text files.
Are they indexing on load, at the start of the find or some other clever technique?
I desperately need a faster implementation of what I have which is a desperately slow walk from top to bottom of the text.
Any ideas are really appreciated.
This is for a C# implementation, but its the technique I'm interested in more than the actual code.
Begin with Boyer-Moore search algorithm. It requires some preprocessing (which is fast) and does searching pretty well - especially when searching for long substrings.
I wouldn't be surprised if most just use the basic, naive search technique (scan for a match on the 1st char, then test if the hit pans out).
The cost of trying too hard: String searching
Eric Lippert's comment in the above blog post
grep
Although not a text editor in itself, but often called by many text editors. I'm curious if you have you tried grep's source code? It always has seemed blazingly fast to me even when searching large files.
One method I know of which is not yet mentioned is the Knuth-Morris-Pratt-Search (KMP), but it isn't so good for language texts (it's due to a prefixed property of the algorithm), but for stuff like DNA matching it is very very good.
Another one is a hash-Search (I don't know if there is an official name). First, you calc a hash value of your pattern and then you make a sliding window (with the size of your pattern) and move it over your text and seeing if the hashes match. The idea here is to choose the hash in a way that you don't have to compute the hash for the complete window but you update your hash just with the next char (and the old char drops out of the hash computation). This algorithm performs very very well when you have multiple strings to search for (because you just compute beforehand your hashes for your strings).

Resources