English word capture - ruby

I have a big text file which contains many English words. However it contains German and French words as well. I need to capture all English words in it.
I reckon, firstly I read all file from the disk and convert it into an array, second I match the all words against unix English word dictionary like here, yet it is not a good solution because of the size of each file. If I do in that way, complexity will be high, and I don't want that.
Do you have any idea how I can do it with Ruby in a simple way?

First thing you could do is to put the english dictionary to a set (instead of array). This way, the lookup is O(1) and overall complexity is O(N) instead of O(NxM).

Related

Algorithm to search for a list of words in a text

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.
I can see two obvious ways to do this --
A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.
Is there a better solution?
I am using python though I am not sure if that changes the algorithm anyway.
Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.
I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.
Here are the previously asked questions on SO -
Is there an efficient algorithm to perform inverted full text search?
Searching a large list of words in another large list
EDIT: I just found another question on SO which is about the same problem.
Algorithm for multiple word matching in text
I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?
There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.
The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.
You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:
"dog|cat|horse|skunk"
And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.
There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.
If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.
Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:
"\b(cat|dog|horse|skunk)\b"
The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:
hashTable = Build hash table from target words
for each word in input text
if word in hashTable then
output word
Or, if you want a list of matching words that are in the input text:
hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
if word in hashTable then
add word to foundWords

Algorithm to reform a sentence from sentence whose spaces are removed and alphabets of words are reordered?

I was looking around some puzzles online to improve my knowledge on algorithms...
I came upon below question:
"You have a sentence with several words with spaces remove and words having their character order shuffled. You have a dictionary. Write an algorithm to produce the sentence back with spaces and words with normal character order."
I do not know what is good way to solve this.
I am new to algorithms but just looking at problem I think I would make program do what an intellectual mind would do.
Here is something I can think of:
-First find out manually common short english words from dictionary like "is" "the" "if" etc and put in dataset-1.
-Then find out permutation of words in dataset1 (eg "si", "eht" or "eth" or "fi") and put in dataset-2
-then find out from input sentense what character sequence matches the words of dataset2 and put them in dataset-3 and insert space in input sentence instead of those found.
-for rest of the words i would perform permutations to find out word from dictionary.
I am newbie to algorithms...is it a bad solution?
this seems like a perfectly fine solution,
In general there are 2 parameters for judging an algorithm.
correctness - does the algorithm provide the correct answer.
resources - the time or storage size needed to provide an answer.
usually there is a tradeoff between these two parameters.
so for example the size of your dictionary dictates what scrambled sentences you may
reconstruct, giving you a correct answer for more inputs,
however the whole searching process would take longer and would require more storage.
The hard part of the problem you presented is the fact that you need to compute permutations, and there are a LOT of them.
so checking them all is expensive, a good approach would be to do what you suggested, create a small subset of commonly used words and check them first, that way the average case is better.
note: just saying that you check the permutation/search is ok, but in the end you would need to specify the exact way of doing that.
currently what you wrote is an idea for an algorithm but it would not allow you to take a given input and mechanically work out the output.
Actually, might be wise to start by partitioning the dictionary by word length.
Then try to find the largest words that can be made using the letters avaliable, instead of finding the smallest ones. Short words are more common and thus will be harder to narrow down. IE: is it really "If" or "fig".
Then for each word length w, you can proceed w characters at a time.
There are still a lot of possible combinations though, simply because you found a valid word, doesn't mean it's the right word. Once you've gone through all the substrings, of which there should be something like O(c^4*d) where d is the number of words in the dictionary and c is the number of characters in the sentence. Practically speaking if the dictionary is sorted by word length, it'll be a fair bit less than that. Then you have to take the valid words, and figure out an ordering that works, so that all characters are used. There might be multiple solutions.

Creating a suggested words algorithm

I'm designing a cool spell checker (I know I know, modern browsers already have this), anyway, I am wondering what kind of effort would it take to develop a fairly simple but decent suggest-word algorithm.
My idea is that I would first look through the misspelled word's characters and count the amount of characters it matches in each word in the dictionary (sounds resources intensive), and then pick the top 5 matches (so if the misspelled word matches the most characters with 7 words from the dictionary, it will randomly display 5 of those words as suggested spelling).
Obviously to get more advanced, we would look at "common words" and have a dictionary file that is numbered with 'frequency of that word used in English language' ranking. I think that's taking it a bit overboard maybe.
What do you think? Anyone have ideas for this?
First of all you will have to consider the complexity in finding the "nearer" words to the misspelled word. I see that you are using a dictionary, a hash table perhaps. But this may not be enough. The best and cooler solution here is to go for a TRIE datastructure. The complexity of finding these so called nearer words will take linear order timing and it is very easy to exhaust the tree.
A small example
Take the word "njce". This is a level 1 example where one word is clearly misspelled. The obvious suggestion expected would be nice. The first step is very obvious to see whether this word is present in the dictionary. Using the search function of a TRIE, this could be done O(1) time, similar to a dictionary. The cooler part is finding the suggestions. You would obviously have to exhaust all the words that start with 'a' to 'z' that has words like ajce bjce cjce upto zjce. Now to find the occurences of this type is again linear depending on the character count. You should not carried away by multiplying this number with 26 the length of words. Since TRIE immediately diminishes as the length grows. Coming back to the problem. Once that search is done for which no result was found, you go the next character. Now you would be searching for nace nbce ncce upto nzce. In fact you wont have explore all the combinations as the TRIE data structure by itself will not be having the intermediate characters. Perhaps it will have na ni ne no nu characters and the search space becomes insanely simple. So are the further occurrences. You could develop on this concept further based on second and third order matches. Hope this helped.
I'm not sure how much of the wheel you're trying to reinvent, so you may want to check out Lucene.
Apache Lucene Coreā„¢ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

Algorithm wanted: Find all words of a dictionary that are similar to words in a free text

We have a list of about 150,000 words, and when the user enters a free text, the system should present a list of words from the dictionary, that are very close to words in the free text.
For instance, the user enters: "I would like to buy legoe toys in Walmart". If the dictionary contains "Lego", "Car" and "Walmart", the system should present "Lego" and "Walmart" in the list. "Walmart" is obvious because it is identical to a word in the sentence, but "Lego" is similar enough to "Legoe" to be mentioned, too. However, nothing is similar to "Car", so that word is not shown.
Showing the list should be realtime, meaning that when the user has entered the sentence, the list of words must be present on the screen. Does anybody know a good algorithm for this?
The dictionary actually contains concepts which may include a space. For instance, "Lego spaceship". The perfect solution recognizes these multi-word concepts, too.
Any suggestions are appreciated.
Take a look at http://norvig.com/spell-correct.html for a simple algorithm. The article uses Python, but there are links to implementations in other languages at the end.
You will be doing quite a few lookups of words against a fixed dictionary. Therefore you need to prepare your dictionary. Logically, you can quickly eliminate candidates that are "just too different".
For instance, the words car and dissimilar may share a suffix, but they're obviously not misspellings of each other. Now why is that so obvious to us humans? For starters, the length is entirely different. That's an immediate disqualification (but with one exception - below). So, your dictionary should be sorted by word length. Match your input word with words of similar length. For short words that means +/- 1 character; longer words should have a higher margin (exactly how well can your demographic spell?)
Once you've restricted yourself to candidate words of similar length, you'd want to strip out words that are entirely dissimilar. With this I mean that they use entirely different letters. This is easiest to compare if you sort the letters in a word alphabetically. E.g. car becomes "acr"; rack becomes "ackr". You'll do this in preprocessing for your dictionary, and for each input word. The reason is that it's cheap to determine the (size of an) difference of two sorted sets. (Add a comment if you need explanation). car and rack have an difference of size 1, car and hat have a difference of size 2. This narrows down your set of candidates even further. Note that for longer words, you can bail out early when you've found too many differences. E.g. dissimilar and biography have a total difference of 13, but considering the length (8/9) you can probably bail out once you've found 5 differences.
This leaves you with a set of candidate words that use almost the same letters, and also are almost the same length. At this point you can start using more refined algorithms; you don't need to run 150.000 comparisons per input word anymore.
Now, for the length exception mentioned before: The problem is in "words" like greencar. It doesn't really match a word of length 8, and yet for humans it's quite obvious what was meant. In this case, you can't really break the input word at any random boundary and run an additional N-1 inexact matches against both halves. However, it is feasible to check for just a missing space. Just do a lookup for all possible prefixes. This is efficient because you'll be using the same part of the dictionary over and over, e.g. g gr, gre, gree, etc. For every prefix that you've found, check if the remaining suffixis also in the dictionery, e.g. reencar, eencar. If both halves of the input word are in the dictionary, but the word itself isn't, you can assume a missing space.
You would likely want to use an algorithm which calculates the Levenshtein distance.
However, since your data set is quite large, and you'll be comparing lots of words against it, a direct implementation of typical algorithms that do this won't be practical.
In order to find words in a reasonable amount of time, you will have to index your set of words in some way that facilitates fuzzy string matching.
One of these indexing methods would be to use a suffix tree. Another approach would be to use n-grams.
I would lean towards using a suffix tree since I find it easier to wrap my head around it and I find it more suited to the problem.
It might be of interest to look at a some algorithms such as the Levenshtein distance, which can calculate the amount of difference between 2 strings.
I'm not sure what language you are thinking of using but PHP has a function called levenshtein that performs this calculation and returns the distance. There's also a function called similar_text that does a similar thing. There's a code example here for the levenshtein function that checks a word against a dictionary of possible words and returns the closest words.
I hope this gives you a bit of insight into how a solution could work!

Algorithm that Generates Unique Serial Number for Each English Word

For an application I need to generate unique serial numbers for each English word.
What would be the best approach?
One constraint is serial number generation algorithm should be very effective in an ordinary desktop computer.
Thanks
Do you have a list of all possible words? If yes, start from 0 at the first word and increment the serial by 1 for each word.
If not then a simple way to guarantee they are unique is to use the word itself as the serial. For example, ABC = 0x41 0x42 0x43 = 4276803.
As suggested in the comments there are other ways (that however require more work), such as compressing the words first with, for example, Huffman.
This of course gets awkward with long words: The serial of Pneumonoultramicroscopicsilicovolcanoconiosis would require around 100 digits, for example.
Otherwise you can use a hash, but there is no guarantee it will be unique for all English words.
You appear to be asking about a perfect hashing function. If so, take a look at this Wikipedia article, and at the gperf utility.
Here is an algorithm (in python) that allows you to code and decode any combination of lowercase letters:
def encode(s):
r = 1
for i in len(s):
r = r * 26 + (ord(s[i]) - ord('a'))
return r
Using 64 bits you can code up to 12 letter words. You can use the remaining unused serials as in index to a table containing low-frequency very long words.
Just use a 64-bit hash function, like Fowler-Noll-Vo. You're not likely to get collisions using a 64-bit integer, as this gives you 2^64 possible values, and there are certainly way less than that many words in the English language. You'd need to normalize each word, of course, (convert to lower-case, etc.)
Do you really need it to be 'serial'? if not - did you try to use the various hash algorithms? Several of them are built into .NET (MD5 and SHA1 if I remember correctly). I am not sure which one will be good enough especially with short strings
Are you looking for every word, or every word in the English dictionary? Are you using standard words - i.e. from the Oxford English Dictionary or are slang words included too? I guess what I'm getting at is: "How big is your dictionary"? You could use an MD5 hash which has a theoretical possibility of collisions - albeit 1 in billions of hashes that may collide - although, I can't say I'd understand the purpose of using a hash over using the actual word. Unless perhaps you're wanting to calculate the serial client side so that it's referencing a correct dictionary item on the server side without having to parse the dictionary looking for its serial. Of course - the word obviously has to be sufficiently unique in order for us to understand it as humans, and we're way more efficient at parsing the meaning of words than a computer is at doing the same.
Are you looking to separate words that look the same but are pronounced differently? Words that look and sound the same but have different meanings? If so, then you're going to come unstuck with a hash, as the same spelling with a different semantic will produce the same hash, so it won't work for this scenario. In this case you'd need some kind of incremental system. If you add words after the fact to the dictionary, will they be added at the end and just given the next serial number in sequence? What if that word is spelled the same as another word but sounds different or sounds the same but has a different semantic? What then?
I guess it depends on the purpose of the serialization as to what would be the most suitable output for your serial number and hence what would be the most efficient algorithm.
The most efficient algorithm would probably be to split your dictionary into the same number of chunks as you have processors and have a thread on each processor serialize the words in its chunk recombining the output from each thread at the end. This (in theory) would work at a speed slightly slower than O(n/number of processors) in real world performance, however I think for mathematical correctness that's still O(n) because you still have to parse the whole dictionary once to serialize each word.
I think the safest way to go is:
Worry about what you've got now
Order them in the most logical sequence (alphabetically?)
Number them in sequence
Add new words (whether spelled the same or not and having different semantics) at the end; give them the next number in the sequence, regardless of their rightful place in the dictionary alphabetically.
This way you don't have to worry about leaving spaces in the serial numbers to account for insertions between words, you don't have to worry about reindexing any dependent data to account for changes in indexes when words are inserted, you just carry on as normal. You don't have to worry about collisions, and you still get the most efficient indexing mechanism for storage purposes meaning you're not storing MD5 hashes that are potentially longer than the original word - which makes no sense for real world use.
If you need to access the dictionary alphabetically, just sort by the word, otherwise, don't.
I still think I'm at a loss as to the necessity of serializing the word - except for storage purposes where you can store your dictionary and link tables by the word's key.
I wonder if an answer is even possible.
Are color and colour the same word? Do they get one serial number or two?
Are polish and Polish the same word?
Are watch (noun) and watch (verb) the same word?
Are multiply (verb) and multiply (adverb) the same word?
Analysis (singular noun) and analyses (plural noun) are not the same word. Are analyse (plural verb) and analyze (plural verb) the same word? Are analyses (singular verb) and analyzes (singular verb) the same word? Are analyses (singular verb) and analyses (plural noun) the same word?
Are wont and won't the same word?
Are Beijing and Peking the same word? Or maybe they aren't English, since Londres and Frankreich aren't English, but then what is the English word for the capital of the Middle Country?
About about MD5 hash algorithm. Do something like this:
serialNumber = MD5( ToLower ( english word ) )

Resources