Fast text editor find - algorithm

Does anyone know how text editors/programmers editors are able to do such fast searches on very large text files.
Are they indexing on load, at the start of the find or some other clever technique?
I desperately need a faster implementation of what I have which is a desperately slow walk from top to bottom of the text.
Any ideas are really appreciated.
This is for a C# implementation, but its the technique I'm interested in more than the actual code.

Begin with Boyer-Moore search algorithm. It requires some preprocessing (which is fast) and does searching pretty well - especially when searching for long substrings.

I wouldn't be surprised if most just use the basic, naive search technique (scan for a match on the 1st char, then test if the hit pans out).
The cost of trying too hard: String searching
Eric Lippert's comment in the above blog post

grep
Although not a text editor in itself, but often called by many text editors. I'm curious if you have you tried grep's source code? It always has seemed blazingly fast to me even when searching large files.

One method I know of which is not yet mentioned is the Knuth-Morris-Pratt-Search (KMP), but it isn't so good for language texts (it's due to a prefixed property of the algorithm), but for stuff like DNA matching it is very very good.
Another one is a hash-Search (I don't know if there is an official name). First, you calc a hash value of your pattern and then you make a sliding window (with the size of your pattern) and move it over your text and seeing if the hashes match. The idea here is to choose the hash in a way that you don't have to compute the hash for the complete window but you update your hash just with the next char (and the old char drops out of the hash computation). This algorithm performs very very well when you have multiple strings to search for (because you just compute beforehand your hashes for your strings).

Related

Is there a hashing algorithm that is tolerant of minor differences?

I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.
Are there any hashing algorithms that work for something like this?
A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.
I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.
This might be a good place to use the Levenshtein distance metric, which quantifies the amount of editing required to transform one sequence into another.
The drawback of this approach is that you'd need to keep the full text of each page so that you could compare them later. With a hash-based approach, on the other hand, you simply store some sort of small computed value and don't require the previous full text for comparison.
You also might try some sort of hybrid approach--let a hashing algorithm tell you that any change has been made, and use it as a trigger to retrieve an archival copy of the document for more rigorous (Levenshtein) comparison.
http://www.phash.org/ did something like this for images. The jist: Take an image, blur it, convert it to greyscale, do a discrete cosine transform, and look at just the upper left quadrant of the result (where the important information is). Then record a 0 for each value less than the average and 1 for each value more than the average. The result is pretty good for small changes.
Min-Hashing is another possibility. Find features in your text and record them as a value. Concatenate all those values to make a hash string.
For both of the above, use a vantage point tree so that you can search for near-hits.
I am sorry to say, but hash algorithms are precisely. Theres none capable of be tolerant of minor differences. You should take another approach.

OCR error correction algorithms

I'm working on digitizing a large collection of scanned documents, working with Tesseract 3 as my OCR engine. The quality of its output is mediocre, as it often produces both garbage characters before and after the actual text, and misspellings within the text.
For the former problem, it seems like there must be strategies for determining which text is actually text and which text isn't (much of this text is things like people's names, so I'm looking for solutions other than looking up words in a dictionary).
For the typo problem, most of the errors stem from a few misclassifications of letters (substituting l, 1, and I for one another, for instance), and it seems like there should be methods for guessing which words are misspelled (since not too many words in English have a "1" in the middle of them), and guessing what the appropriate correction is.
What are the best practices in this space? Are there free/open-source implementations of algorithms that do this sort of thing? Google has yielded lots of papers, but not much concrete. If there aren't implementations available, which of the many papers would be a good starting place?
For "determining which text is actually text and which text isn't" you might want to look at rmgarbage from same department that developed Tesseract (the ISRI). I've written a Perl implementation and there's also a Ruby implementation. For the 1 vs. l problem I'm experimenting with ocrspell (again from the same department), for which their original source is available.
I can only post two links, so the missing ones are:
ocrspell: enter "10.1007/PL00013558" at dx.doi.org]
rmgarbage: search for "Automatic Removal of Garbage Strings in OCR Text: An Implementation"
ruby implementation: search for "docsplit textcleaner"
Something that could be useful for you is to try this free online OCR and compare its results with yours to see if by playing with the image (e.g. scaling up/down) you could improve the results.
I was using it as an "upper bound" of the results I should get when using tesseract myself (after using OpenCV to modify the images).

How do I approximate "Did you mean?" without using Google?

I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?

Extract small relevant bits text (as Google does) from the full text search results

I have implemented a full text search in a discussion forum database and I want to display
the search results in a way Google does. Even for a very long html page only a two or three
lines of the texts displayed in a search result list. Usually these are the lines
which contain a search terms.
What would be the good algorithm of how to extract a few lines of the text based on the text itself and a search terms. I could think of something as easy as just using one line of text before the search term occurrence in a text and a line after - but that seems to be too simple to work.
Would like to get a few directions, ideas and insights.
Thank you.
If you are looking for something fancier than the 'line before/after' approach, a summarizer might do the trick.
Here's a Naive Bayes based system: http://classifier4j.sourceforge.net/
Bayes is the statistical system used by many spam filters - I researched Bayes summarizers a few years back, and found that they do a pretty good job of summarizing text, as long as there is a decent amount of text to process. I haven't actually tried the above library, though, so your mileage may vary.
Have you tried the "line before/after search term occurrance" in code to see if for that simple coding investment the results are good enough for what you want? Might already be enough?
Otherwise, you could go for pieces of sentences: so don't split on lines, but on newlines, full stops, comma's, spaced out hyphens etc. Then show the pieces that contain the search terms. You could separate each matching sentence piece with "..." or something.
If you get a lot of these pieces, you could try to prioritize the pieces, sort on descending priority and only show the first n of them. And/or cut down the pieces to just the search term and a couple of words around the search term.
Just a couple of informal ideas that might get you started?
Concentrate on the beginning of the content. Think of where you would look when you visit a blog. The beginning para tells you whether the article is in the right direction. So in your algorithm it will make sense to reflect this.
Check for occurrences of the search term in headings (H1,H2 etc) and give more priority to them.
This should get you started.

Need a high efficient algorithm to check if a string contains english speech

I have got many Strings. All of them contain only characters. Characters and words are not splittet with a space from each other. Some of the characters form english words and other just bufflegab. The Strings may not contain a whole sentence.
I need to find out which of them are written in valid english speech. What I mean with that is, that the String could be build by concatenating well written english words. I know I could do something with a wordlist. But the words are not splittet from each other. So it could be very time-consuming to test every possible word combination.
I am searching for an high performance algorithm or method that check if the strings are built of english words or english speech. Maybe there is something that gives me the chance that the string contains english speech.
Do you know a method or algorithm that helps me?
Does something like Sphinx help me?
This is called the segmentation problem.
There is no trivial way to solve this. What I can suggest to you based on my guess of your knowledge level, is to build a trie out of your dictionary, and at the first chance you detect a possible word, try assuming that it is the word.
If later on, you find out that the last part of the word is gibberish, then you backtrack to the last time you decided a sequence of letter was a word, and ignore that word.
If your strings are long enough or your bufflegab strange enough, letter frequencies - possibly also bigram frequencies, trigram frequencies, etc. - might be sufficient (instead of the more general N-grams). For example, some browsers use that to guess the code page.
Check N-gram language model.
See http://en.wikipedia.org/wiki/N-gram
Sphinx probably won't help you. Try Rabin-Karp algorithm. It is awful for standard search but should work well for this particular problem. Basically, you'll want to have a dictionary of English words and will want to search with it. Overly large dictionaries will still be pretty slow, but if you use a small dictionary for common words and switch to a big one only when you hit common words, you probably still won't get too many false negatives.
Why not store your wordlist in a Trie. Then you iterate through the input looking for matching words in the Trie - this can be done very efficiently. If you find one, advance to the end of the word and continue.
It depends on what accuracy you want, how efficient you need it to be, and what kind of text you are processing.

Resources