Trouble printing hashtable values in Ruby - ruby

I have made a hash table of English Words and their values from a text file by parsing the first word from each line as the word to be defined and all words but the but the first as the definition, using this code:
words = Hash.new
File.open("path/dictionary.txt") do|file|
file.each do |line|
n = line.split.size
definition = line.strip[/(?<=\s).*/]
words[line.strip.split[0...1]] = definition
end
end
However, when I try to print a value using code such as this:
p words["Definition"]
It prints 'nil'. I am still able to print the whole hashtable using 'p words' so I dont understand. Any ideas? Thanks
EDIT: Here is the beginning of dictionary.txt to give you an idea of what I'm doing:
A- prefix (also an- before a vowel sound) not, without (amoral). [greek]
Aa abbr. 1 automobile association. 2 alcoholics anonymous. 3 anti-aircraft.
Aardvark n. Mammal with a tubular snout and a long tongue, feeding on termites. [afrikaans]
Ab- prefix off, away, from (abduct). [latin]
Aback adv. take aback surprise, disconcert. [old english: related to *a2]
Abacus n. (pl. -cuses) 1 frame with wires along which beads are slid for calculating. 2 archit. Flat slab on top of a capital. [latin from greek from hebrew]
Abaft naut. —adv. In the stern half of a ship. —prep. Nearer the stern than. [from *a2, -baft: see *aft]
Abandon —v. 1 give up. 2 forsake, desert. 3 (often foll. By to; often refl.) Yield to a passion, another's control, etc. —n. Freedom from inhibitions. abandonment n. [french: related to *ad-, *ban]
Abandoned adj. 1 deserted, forsaken. 2 unrestrained, profligate.
Abase v. (-sing) (also refl.) Humiliate, degrade. abasement n. [french: related to *ad-, *base2]
Abashed predic. Adj. Embarrassed, disconcerted. [french es- *ex-1, baïr astound]
Abate v. (-ting) make or become less strong etc.; diminish. abatement n. [french abatre from latin batt(u)o beat]
Abattoir n. Slaughterhouse. [french abatre fell, as *abate]
Abbacy n. (pl. -ies) office or jurisdiction of an abbot or abbess. [latin: related to *abbot]
Abbé n. (in france) abbot or priest. [french from latin: related to *abbot]
Abbess n. Head of a community of nuns.
Abbey n. (pl. -s) 1 building(s) occupied by a community of monks or nuns. 2 the community itself. 3 building that was once an abbey.

Looks like the keys in your hash are arrays with one element.
Change to
words[line.strip.split[0...1][0]]=definition

Since you do not post the text, it is hard to tell what you want, but I guess this is what you want:
words = {}
File.foreach("path/dictionary.txt") do |line|
words.store(*line.strip.split(/\s+/, 2))
end
If you have empty lines in the text,
words = {}
File.foreach("path/dictionary.txt") do |line|
words.store(*line.strip.split(/\s+/, 2)) if line =~ /\S/
end

Related

How do I convert a spreadsheet "letternamed" column coordinate to an integer?

In spreadsheets I have cells named like "F14", "BE5" or "ALL1". I have the first part, the column coordinate, in a variable and I want to convert it to a 0-based integer column index.
How do I do it, preferably in an elegant way, in Ruby?
I can do it using a brute-force method: I can imagine loopping through all letters, converting them to ASCII and adding to a result, but I feel there should be something more elegant/straightforward.
Edit: Example: To simplify I do only speak about the column coordinate (letters). Therefore in the first case (F14) I have "F" as the input and I expect the result to be 5. In the second case I have "BE" as input and I expect getting 56, for "ALL" I want to get 999.
Not sure if this is any clearer than the code you already have, but it does have the advantage of handling an arbitrary number of letters:
class String
def upcase_letters
self.upcase.split(//)
end
end
module Enumerable
def reverse_with_index
self.map.with_index.to_a.reverse
end
def sum
self.reduce(0, :+)
end
end
def indexFromColumnName(column_str)
start = 'A'.ord - 1
column_str.upcase_letters.map do |c|
c.ord - start
end.reverse_with_index.map do |value, digit_position|
value * (26 ** digit_position)
end.sum - 1
end
I've added some methods to String and Enumerable because I thought it made the code more readable, but you could inline these or define them elsewhere if you don't like that sort of thing.
We can use modulo and the length of the input. The last character will
be used to calculate the exact "position", and the remainders to count
how many "laps" we did in the alphabet, e.g.
def column_to_integer(column_name)
letters = /[A-Z]+/.match(column_name).to_s.split("")
laps = (letters.length - 1) * 26
position = ((letters.last.ord - 'A'.ord) % 26)
laps + position
end
Using decimal representation (ord) and the math tricks seems a neat
solution at first, but it has some pain points regarding the
implementation. We have magic numbers, 26, and constants 'A'.ord all
over.
One solution is to give our code better knowlegde about our domain, i.e.
the alphabet. In that case, we can switch the modulo with the position of
the last character in the alphabet (because it's already sorted in a zero-based array), e.g.
ALPHABET = ('A'..'Z').to_a
def column_to_integer(column_name)
letters = /[A-Z]+/.match(column_name).to_s.split("")
laps = (letters.length - 1) * ALPHABET.size
position = ALPHABET.index(letters.last)
laps + position
end
The final result:
> column_to_integer('F5')
=> 5
> column_to_integer('AK14')
=> 36
HTH. Best!
I have found particularly neat way to do this conversion:
def index_from_column_name(colname)
s=colname.size
(colname.to_i(36)-(36**s-1).div(3.5)).to_s(36).to_i(26)+(26**s-1)/25-1
end
Explanation why it works
(warning spoiler ;) ahead). Basically we are doing this
(colname.to_i(36)-('A'*colname.size).to_i(36)).to_s(36).to_i(26)+('1'*colname.size).to_i(26)-1
which in plain English means, that we are interpreting colname as 26-base number. Before we can do it we need to interpret all A's as 1, B's as 2 etc. If only this is needed than it would be even simpler, namely
(colname.to_i(36) - '9'*colname.size).to_i(36)).to_s(36).to_i(26)-1
unfortunately there are Z characters present which would need to be interpreted as 10(base 26) so we need a little trick. We shift every digit 1 more then needed and than add it at the end (to every digit in original colname)
`

Algorithm to create unique random concatenation of items

I'm thinking about an algorithm that will create X most unique concatenations of Y parts, where each part can be one of several items. For example 3 parts:
part #1: 0,1,2
part #2: a,b,c
part #3: x,y,z
And the (random, one case of some possibilities) result of 5 concatenations:
0ax
1by
2cz
0bz (note that '0by' would be "less unique " than '0bz' because 'by' already was)
2ay (note that 'a' didn't after '2' jet, and 'y' didn't after 'a' jet)
Simple BAD results for next concatenation:
1cy ('c' wasn't after 1, 'y' wasn't after 'c', BUT '1'-'y' already was as first-last
Simple GOOD next result would be:
0cy ('c' wasn't after '0', 'y' wasn't after 'c', and '0'-'y' wasn't as first-last part)
1az
1cx
I know that this solution limit possible results, but when all full unique possibilities will gone, algorithm should continue and try to keep most avaible uniqueness (repeating as few as possible).
Consider real example:
Boy/Girl/Martin
bought/stole/get
bottle/milk/water
And I want results like:
Boy get milk
Martin stole bottle
Girl bought water
Boy bought bottle (not water, because of 'bought+water' and not milk, because of 'Boy+milk')
Maybe start with a tree of all combinations, but how to select most unique trees first?
Edit: According to this sample data, we can see, that creation of fully unique results for 4 words * 3 possibilities, provide us only 3 results:
Martin stole a bootle
Boy bought an milk
He get hard water
But, there can be more results requested. So, 4. result should be most-available-uniqueness like Martin bought hard milk, not Martin stole a water
Edit: Some start for a solution ?
Imagine each part as a barrel, wich can be rotated, and last item goes as first when rotates down, first goes as last when rotating up. Now, set barells like this:
Martin|stole |a |bootle
Boy |bought|an |milk
He |get |hard|water
Now, write sentences as We see, and rotate first barell UP once, second twice, third three and so on. We get sentences (note that third barell did one full rotation):
Boy |get |a |milk
He |stole |an |water
Martin|bought|hard|bootle
And we get next solutions. We can do process one more time to get more solutions:
He |bought|a |water
Martin|get |an |bootle
Boy |stole |hard|milk
The problem is that first barrel will be connected with last, because rotating parallel.
I'm wondering if that will be more uniqe if i rotate last barrel one more time in last solution (but the i provide other connections like an-water - but this will be repeated only 2 times, not 3 times like now). Don't know that "barrels" are good way ofthinking here.
I think that we should first found a definition for uniqueness
For example, what is changing uniqueness to drop ? If we use word that was already used ? Do repeating 2 words close to each other is less uniqe that repeating a word in some gap of other words ? So, this problem can be subjective.
But I think that in lot of sequences, each word should be used similar times (like selecting word randomly and removing from a set, and after getting all words refresh all options that they can be obtained next time) - this is easy to do.
But, even if we get each words similar number od times, we should do something to do-not-repeat-connections between words. I think, that more uniqe is repeating words far from each other, not next to each other.
Anytime you need a new concatenation, just generate a completely random one, calculate it's fitness, and then either accept that concatenation or reject it (probabilistically, that is).
const C = 1.0
function CreateGoodConcatenation()
{
for (rejectionCount = 0; ; rejectionCount++)
{
candidate = CreateRandomConcatination()
fitness = CalculateFitness(candidate) // returns 0 < fitness <= 1
r = GetRand(zero to one)
adjusted_r = Math.pow(r, C * rejectionCount + 1) // bias toward acceptability as rejectionCount increases
if (adjusted_r < fitness)
{
return candidate
}
}
}
CalculateFitness should never return zero. If it does, you might find yourself in an infinite loop.
As you increase C, less ideal concatenations are accepted more readily.
As you decrease C, you face increased iterations for each call to CreateGoodConcatenation (plus less entropy in the result)

Given a word, how do I get the list of all words, that differ by one letter?

Let's say I have the word "CAT". These words differ from "CAT" by one letter (not the full list)
CUT
CAP
PAT
FAT
COT
etc.
Is there an elegant way to generate this? Obviously, one way to do it is through brute force.
pseduo code:
while (0 to length of word)
while (A to Z)
replace one letter at a time, and check if the resulting word is a valid word
If I had a 10 letter word, the loop would run 26 * 10 = 260 times.
Is there a better, elegant way to do this?
Given a list of words, for example with
words = set(line.strip().lower() for line in open('/usr/share/dict/words'))
you can build and index of "wildcarded" words, where you replace each character of the word with a wildcard (say "?"), so that for example "gat" and "fat" both get indexed to "?at":
def wildcard(s, idx):
return s[:idx] + '?' + s[idx+1:]
def wildcarded(s):
for idx in xrange(len(s)):
yield wildcard(s, idx)
# list(wildcarded('cat')) returns ['?at', 'c?t', 'ca?']
from collections import defaultdict
index = defaultdict(list)
for word in words:
for w in wildcarded(word):
index[w].append(word)
Now if you want to look for all the words that differ by one letter from "cat", just look for "?at", "c?t" and "ca?" and concatenate the results:
def near_words(word):
ret = []
for w in wildcarded(word):
ret += index[w]
return ret
print near_words('cat')
# outputs ['cat', 'bat', 'zat', 'jat', 'kat', 'rat', 'sat', 'pat', 'hat', 'oat', 'gat', 'vat', 'nat', 'fat', 'lat', 'wat', 'eat', 'yat', 'mat', 'tat', 'cat', 'cut', 'cot', 'cit', 'cay', 'car', 'cap', 'caw', 'cat', 'can', 'cam', 'cal', 'cad', 'cab', 'cag']
print near_words('stack')
# outputs ['stack', 'stack', 'smack', 'spack', 'slack', 'snack', 'shack', 'swack', 'stuck', 'stack', 'stick', 'stock', 'stank', 'stack', 'stark', 'stauk', 'stalk', 'stack']
If the maximum word length is L and the number of words is N, the index is made of O(NL) pointers, while the lookup algorithm runs in time O(L + number of results).
If you want to look for all the words that differ by K letters instead of 1 this approach doesn't generalize well, but it is a very hard problem in full generality (it is the problem of finding neighbors in Hamming spaces).
Work out what your performance requirements really are.
Implement it exactly as you described it above.
Time it, and see if you meet those requirements already.
Optimise only if required (and I am willing to bet it isn't required, because 260 look-ups in a hash table of words that fit in RAM isn't that slow.)
The size of a dictionary for a human-language and a word length are tiny (~10**5 and ~100), therefore a brute-force approach will do unless measurements shows otherwise in your case:
#!/usr/bin/env python
import string
ALL_WORDS = set(open('/usr/share/dict/words').read().lower().split())
ALPHABET = string.ascii_lowercase
def known(words): return set(w for w in words if w in ALL_WORDS)
def one_letter(word):
# http://norvig.com/spell-correct.html
splits = ((word[:i], word[i:]) for i in range(len(word) + 1))
replaces = (a + c + b[1:] for a, b in splits for c in ALPHABET if b)
return set(replaces)
from pprint import pprint
pprint(known(one_letter("cat")))
Output
set(['bat',
'cab',
'cad',
'cal',
'cam',
'can',
'cap',
'car',
'cat',
'caw',
'cot',
'cut',
'eat',
'fat',
'hat',
'mat',
'nat',
'oat',
'pat',
'rat',
'sat',
'tat',
'vat'])
You'll need a dictionary of valid words to check against, or otherwise the problem isn't going to generate "words" but "strings" rather. There are many available for free online, or if you're on Linux most distros ship with dictionary files in /usr/share/dict/.
There are two approaches to take:
For each letter in the word, replace it with all other 25 characters and check if it's in the dictionary. Use a hashtable to store the dictionary words for efficient querying. You only need to populate the hashtable with words of the same length as your search word. This will be O(MN + 25N) = O(MN), where M is the number of words of length N in your dictionary and N is the length of your word.
For each dictionary word that is the same length as your search word, check how many characters differ. This will be O(MN).
Although both fall into the same complexity class, the latter drops the O(25N) term and overhead associated with a hashtable.
For: l = word length, w = number of words in wordlist:
Your algorithm is O(l.(l log w)) for a tree wordlist, plus the cost of constructing the wordlist in the first place (which is O(w log (w))) (I assume a tree here, you can redo this with a hash if you like).
This is O(l.w)
As another answer already suggests, you don't care that the word has an a, b or z in place of the character you want to change, you just care that it's not the letter that you started with. So test the one combination you don't want, rather than all of the combinations that would do.
So:
for(each candidate word from the wordlist) {
difference = 0
for(each letter in your original word) {
does it match? if not, difference++
}
if difference = 1, store the candidate word as a solution
}
Now, you're going to argue that you're looking at 78 comparisons versus thousands, but that's not accurate: in order to make use of a wordlist to see if a candidate is available, your method involves creating a content-addressed structure (a tree or hash) before you even start, plus the lookups into the hash once you're running. The solution above also allows you to read the wordlist file once per word under test (without having to hold it in memory for rescanning). Your solution is probably faster for doing this on many words at once, but the above is better for a single word lookup, and more memory efficient in every case.
Credit to other answers for the 'count the difference' method of spotting word differences...
Anyway you'll need to iterate through all letters to check it. But the another approach would be to check dictionary for words, which corresponds to mask ?AT, C?T, CA? (where ? can be every symbol)
If the strings always match in length one way would be to remove one letter at a time and compare result on both strings, 10 chars would be 10 loops.
regards,
/t
Iterate the word list and for each word count the different letter. If the count becomes greater than 1 go to next word.
Faster solution, if dictionary is static and there are plenty of words to check: create a matrix of letters. Rows are first letter in the word, columns are second letter in the word. Cells are list of words that begin with given first and second letter. When you want to find similar words of a given word then iterate through just one row and then just one column. If not on intersecting cell the all other letters of every iterated word must match. On intersecting cell one letter must be different.
If you really want to optimise the run-time (and I still say you probably don't need to in any reasonable performance situation) then go through the dictionary once, and run your algorithm on every word.
Create a map from the corrupted words to a list of each of the corresponding correctly spelled words.
I estimate that the 20,000 words, processing at least 30 a second, will take no more than 11 minutes to process.
Store the resulting hash table on disk, and load it into memory when required. Then perform the processing by simply looking up the input words in the hash table and find the corresponding list of correctly-spelled words.
Memory intensive, but super-fast - and if you are worried about the performance of 260 look-ups, you must be dealing with tens of thousands of words, and a solution like this is probably the best you'll get.

Ruby, Count syllables

I am using ruby to calculate the Gunning Fog Index of some content that I have, I can successfully implement the algorithm described here:
Gunning Fog Index
I am using the below method to count the number of syllables in each word:
Tokenizer = /([aeiouy]{1,3})/
def count_syllables(word)
len = 0
if word[-3..-1] == 'ing' then
len += 1
word = word[0...-3]
end
got = word.scan(Tokenizer)
len += got.size()
if got.size() > 1 and got[-1] == ['e'] and
word[-1].chr() == 'e' and
word[-2].chr() != 'l' then
len -= 1
end
return len
end
It sometimes picks up words with only 2 syllables as having 3 syllables. Can anyone give any advice or is aware of a better method?
text = "The word logorrhoea is often used pejoratively to describe prose that is highly abstract and contains little concrete language. Since abstract writing is hard to visualize, it often seems as though it makes no sense and all the words are excessive. Writers in academic fields that concern themselves mostly with the abstract, such as philosophy and especially postmodernism, often fail to include extensive concrete examples of their ideas, and so a superficial examination of their work might lead one to believe that it is all nonsense."
# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')
word_array = text.split(' ')
word_array.each do |word|
puts word if count_syllables(word) > 2
end
"themselves" is being counted as 3 but it's only 2
The function I give you before is based upon these simple rules outlined here:
Each vowel (a, e, i, o, u, y) in a
word counts as one syllable subject to
the following sub-rules:
Ignore final -ES, -ED, -E (except
for -LE)
Words of three letters or
less count as one syllable
Consecutive vowels count as one
syllable.
Here's the code:
def new_count(word)
word.downcase!
return 1 if word.length <= 3
word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word.sub!(/^y/, '')
word.scan(/[aeiouy]{1,2}/).size
end
Obviously, this isn't perfect either, but all you'll ever get with something like this is a heuristic.
EDIT:
I changed the code slightly to handle a leading 'y' and fixed the regex to handle 'les' endings better (such as in "candles").
Here's a comparison using the text in the question:
# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')
words = text.split(' ')
words.each do |word|
old = count_syllables(word.dup)
new = new_count(word.dup)
puts "#{word}: \t#{old}\t#{new}" if old != new
end
The output is:
logorrhoea: 3 4
used: 2 1
makes: 2 1
themselves: 3 2
So it appears to be an improvement.
One thing you ought to do is teach your algorithm about diphthongs. If I'm reading your code correctly, it would incorrectly flag "aid" as having two syllables.
You can also add "es" and the like to your special-case endings (you already have "ing") and just not count it as a syllable, but that might still result in some miscounts.
Finally, for best accuracy, you should convert your input to a spelling scheme or alphabet that has a definite relationship to the word's pronunciation. With your "themselves" example, the algorithm has no reliable way to know that the "e" "ves" is dropped. However, if you respelled it as "themselvz", or taught the algorithm the IPA and fed it [ðəmsɛlvz], it becomes very clear that the word is only pronounced with two syllables. That, of course, assumes you have control over the input, and is probably more work than just counting the syllables yourself.
To begin with it seems you should decrement len for the suffixes that should be excluded.
len-=1 if /.*[ing,es,ed]$/.match(word)
You could also check out Lingua::EN::Readability.
It can also calculate several readability measures, such as a Fog Index and a Flesch-Kincaid level.
PS. I think I know where you got the function from. DS.
There is also a rubygem called Odyssey that calculates Gunning Fog, along with some of the other popular ones (Flesch-Kincaid, SMOG, etc.)

Good algorithm and data structure for looking up words with missing letters?

I need to write an efficient algorithm for looking up words with missing letters in a dictionary and I want the set of possible words.
For example, if I have th??e, I might get back "these", "those", "theme:, "there", etc.
There will be up to TWO question marks and when two question marks do occur, they will occur in sequence.
I was wondering if anyone can suggest some data structures or algorithm I should use.
A Trie is too space-inefficient and would make it too slow. Any other ideas modifications?
Currently I am using 3 hash tables for when it is an exact match, 1 question mark, and 2 question marks.
Given a dictionary I hash all the possible words. For example, if I have the word WORD. I hash WORD, ?ORD, W?RD, WO?D, WOR?, ??RD, W??D, and WO?? into the dictionary. Then I use a link list to link the collisions together. So say hash(W?RD) = hash(STR?NG) = 17. hashtab(17) will point to WORD and WORD points to STRING because it is a linked list.
The timing on average lookup of one word is about 2e-6s. I am looking to do better, preferably on the order of 1e-9. It took 0.5 seconds for 3m entries insertions and it took 4 seconds for 3m entries lookup.
I believe in this case it is best to just use a flat file where each word stands in one line. With this you can conveniently use the power of a regular expression search, which is highly optimized and will probably beat any data structure you can devise yourself for this problem.
Solution #1: Using Regex
This is working Ruby code for this problem:
def query(str, data)
r = Regexp.new("^#{str.gsub("?", ".")}$")
idx = 0
begin
idx = data.index(r, idx)
if idx
yield data[idx, str.size]
idx += str.size + 1
end
end while idx
end
start_time = Time.now
query("?r?te", File.read("wordlist.txt")) do |w|
puts w
end
puts Time.now - start_time
The file wordlist.txt contains 45425 words (downloadable here). The program's output for query ?r?te is:
brute
crate
Crete
grate
irate
prate
write
wrote
0.013689
So it takes just 37 milliseconds to both read the whole file and to find all matches in it. And it scales very well for all kinds of query patterns, even where a Trie is very slow:
query ????????????????e
counterproductive
indistinguishable
microarchitecture
microprogrammable
0.018681
query ?h?a?r?c?l?
theatricals
0.013608
This looks fast enough for me.
Solution #2: Regex with Prepared Data
If you want to go even faster, you can split the wordlist into strings that contain words of equal lengths and just search the correct one based on your query length. Replace the last 5 lines with this code:
def query_split(str, data)
query(str, data[str.length]) do |w|
yield w
end
end
# prepare data
data = Hash.new("")
File.read("wordlist.txt").each_line do |w|
data[w.length-1] += w
end
# use prepared data for query
start_time = Time.now
query_split("?r?te", data) do |w|
puts w
end
puts Time.now - start_time
Building the data structure takes now about 0.4 second, but all queries are about 10 times faster (depending on the number of words with that length):
?r?te 0.001112 sec
?h?a?r?c?l? 0.000852 sec
????????????????e 0.000169 sec
Solution #3: One Big Hashtable (Updated Requirements)
Since you have changed your requirements, you can easily expand on your idea to use just one big hashtable that contains all precalculated results. But instead of working around collisions yourself you could rely on the performance of a properly implemented hashtable.
Here I create one big hashtable, where each possible query maps to a list of its results:
def create_big_hash(data)
h = Hash.new do |h,k|
h[k] = Array.new
end
data.each_line do |l|
w = l.strip
# add all words with one ?
w.length.times do |i|
q = String.new(w)
q[i] = "?"
h[q].push w
end
# add all words with two ??
(w.length-1).times do |i|
q = String.new(w)
q[i, 2] = "??"
h[q].push w
end
end
h
end
# prepare data
t = Time.new
h = create_big_hash(File.read("wordlist.txt"))
puts "#{Time.new - t} sec preparing data\n#{h.size} entries in big hash"
# use prepared data for query
t = Time.new
h["?ood"].each do |w|
puts w
end
puts (Time.new - t)
Output is
4.960255 sec preparing data
616745 entries in big hash
food
good
hood
mood
wood
2.0e-05
The query performance is O(1), it is just a lookup in the hashtable. The time 2.0e-05 is probably below the timer's precision. When running it 1000 times, I get an average of 1.958e-6 seconds per query. To get it faster, I would switch to C++ and use the Google Sparse Hash which is extremely memory efficient, and fast.
Solution #4: Get Really Serious
All above solutions work and should be good enough for many use cases. If you really want to get serious and have lots of spare time on your hands, read some good papers:
Tries for Approximate String Matching - If well implemented, tries can have very compact memory requirements (50% less space than the dictionary itself), and are very fast.
Agrep - A Fast Approximate Pattern-Matching Tool - Agrep is based on a new efficient and flexible algorithm for approximate string matching.
Google Scholar search for approximate string matching - More than enough to read on this topic.
Given the current limitations:
There will be up to 2 question marks
When there are 2 question marks, they appear together
There are ~100,000 words in the dictionary, average word length is 6.
I have two viable solutions for you:
The fast solution: HASH
You can use a hash which keys are your words with up to two '?', and the values are a list of fitting words. This hash will have around 100,000 + 100,000*6 + 100,000*5 = 1,200,000 entries (if you have 2 question marks, you just need to find the place of the first one...). Each entry can save a list of words, or a list of pointers to the existing words. If you save a list of pointers, and we assume that there are on average less than 20 words matching each word with two '?', then the additional memory is less than 20 * 1,200,000 = 24,000,000.
If each pointer size is 4 bytes, then the memory requirement here is (24,000,000+1,200,000)*4 bytes = 100,800,000 bytes ~= 96 mega bytes.
To sum up this solution:
Memory Consumption: ~96 MB
Time for each search: calculating a hash function, and following a pointer. O(1)
Note: if you want to use a hash of a smaller size, you can, but then it is better to save a balanced search tree in each entry instead of a linked list, for better performance.
The space savvy, but still very fast solution: TRIE variation
This solution uses the following observation:
If the '?' signs were at the end of the word, trie would be an excellent solution.
The search in the trie would search at the length of the word, and for the last couple of letters, a DFS traversal would bring all of the endings.
Very fast, and very memory-savvy solution.
So lets use this observation, in order to build something to work exactly like this.
You can think about every word you have in the dictionary, as a word ending with # (or any other symbol that does not exist in your dictionary).
So the word 'space' would be 'space#'.
Now, if you rotate each of the words, with the '#' sign, you get the following:
space#, pace#s, ace#sp, *ce#spa*, e#spac
(no # as first letter).
If you insert all of these variations into a TRIE, you can easily find the word you are seeking at the length of the word, by 'rotating' your word.
Example:
You want to find all words that fit 's??ce' (one of them is space, another is slice).
You build the word: s??ce#, and rotate it so that the ? sign is in the end. i.e. 'ce#s??'
All of the rotation variations exist inside the trie, and specifically 'ce#spa' (marked with * above). After the beginning is found - you need to go over all of the continuations in the appropriate length, and save them. Then, you need to rotate them again so that the # is the last letter, and walla - you have all of the words you were looking for!
To sum up this solution:
Memory Consumption:
For each word, all of its rotations appear in the trie. On average, *6 of the memory size is saved in the trie. The trie size is around *3 (just guessing...) of the space saved inside it. So the total space necessary for this trie is 6*3*100,000 = 1,800,000 words ~= 6.8 mega bytes.
Time for each search:
rotating the word: O(word length)
seeking the beginning in the trie: O(word length)
going over all of the endings: O(number of matches)
rotating the endings: O(total length of answers)
To sum up, it is very very fast, and depends on the word length * small constant.
To sum up...
The second choice has a great time/space complexity, and would be the best option for you to use. There are a few problems with the second solution (in which case you might want to use the first solution):
More complex to implement. I'm not sure whether there are programming languages with tries built-in out of the box. If there isn't - it means that you'll need to implement it yourself...
Does not scale well. If tomorrow you decide that you need your question marks spread all over the word, and not necessarily joined together, you'll need to think hard of how to fit the second solution to it. In the case of the first solution - it is quite easy to generalize.
To me this problem sounds like a good fit for a Trie data structure. Enter the entire dictionary into your trie, and then look up the word. For a missing letter you would have to try all sub-tries, which should be relatively easy to do with a recursive approach.
EDIT: I wrote a simple implementation of this in Ruby just now: http://gist.github.com/262667.
Directed Acyclic Word Graph would be perfect data structure for this problem. It combines efficiency of a trie (trie can be seen as a special case of DAWG), but is much more space efficient. Typical DAWG will take fraction of size that plain text file with words would take.
Enumerating words that meet specific conditions is simple and the same as in trie - you have to traverse graph in depth-first fashion.
Anna's second solution is the inspiration for this one.
First, load all the words into memory and divide the dictionary into sections based on word length.
For each length, make n copies of an array of pointers to the words. Sort each array so that the strings appear in order when rotated by a certain number of letters. For example, suppose the original list of 5-letter words is [plane, apple, space, train, happy, stack, hacks]. Then your five arrays of pointers will be:
rotated by 0 letters: [apple, hacks, happy, plane, space, stack, train]
rotated by 1 letter: [hacks, happy, plane, space, apple, train, stack]
rotated by 2 letters: [space, stack, train, plane, hacks, apple, happy]
rotated by 3 letters: [space, stack, train, hacks, apple, plane, happy]
rotated by 4 letters: [apple, plane, space, stack, train, hacks, happy]
(Instead of pointers, you can use integers identifying the words, if that saves space on your platform.)
To search, just ask how much you would have to rotate the pattern so that the question marks appear at the end. Then you can binary search in the appropriate list.
If you need to find matches for ??ppy, you would have to rotate that by 2 to make ppy??. So look in the array that is in order when rotated by 2 letters. A quick binary search finds that "happy" is the only match.
If you need to find matches for th??g, you would have to rotate that by 4 to make gth??. So look in array 4, where a binary search finds that there are no matches.
This works no matter how many question marks there are, as long as they all appear together.
Space required in addition to the dictionary itself: For words of length N, this requires space for (N times the number of words of length N) pointers or integers.
Time per lookup: O(log n) where n is the number of words of the appropriate length.
Implementation in Python:
import bisect
class Matcher:
def __init__(self, words):
# Sort the words into bins by length.
bins = []
for w in words:
while len(bins) <= len(w):
bins.append([])
bins[len(w)].append(w)
# Make n copies of each list, sorted by rotations.
for n in range(len(bins)):
bins[n] = [sorted(bins[n], key=lambda w: w[i:]+w[:i]) for i in range(n)]
self.bins = bins
def find(self, pattern):
bins = self.bins
if len(pattern) >= len(bins):
return []
# Figure out which array to search.
r = (pattern.rindex('?') + 1) % len(pattern)
rpat = (pattern[r:] + pattern[:r]).rstrip('?')
if '?' in rpat:
raise ValueError("non-adjacent wildcards in pattern: " + repr(pattern))
a = bins[len(pattern)][r]
# Binary-search the array.
class RotatedArray:
def __len__(self):
return len(a)
def __getitem__(self, i):
word = a[i]
return word[r:] + word[:r]
ra = RotatedArray()
start = bisect.bisect(ra, rpat)
stop = bisect.bisect(ra, rpat[:-1] + chr(ord(rpat[-1]) + 1))
# Return the matches.
return a[start:stop]
words = open('/usr/share/dict/words', 'r').read().split()
print "Building matcher..."
m = Matcher(words) # takes 1-2 seconds, for me
print "Done."
print m.find("st??k")
print m.find("ov???low")
On my computer, the system dictionary is 909KB big and this program uses about 3.2MB of memory in addition to what it takes just to store the words (pointers are 4 bytes). For this dictionary, you could cut that in half by using 2-byte integers instead of pointers, because there are fewer than 216 words of each length.
Measurements: On my machine, m.find("st??k") runs in 0.000032 seconds, m.find("ov???low") in 0.000034 seconds, and m.find("????????????????e") in 0.000023 seconds.
By writing out the binary search instead of using class RotatedArray and the bisect library, I got those first two numbers down to 0.000016 seconds: twice as fast. Implementing this in C++ would make it faster still.
First we need a way to compare the query string with a given entry. Let's assume a function using regexes: matches(query,trialstr).
An O(n) algorithm would be to simply run through every list item (your dictionary would be represented as a list in the program), comparing each to your query string.
With a bit of pre-calculation, you could improve on this for large numbers of queries by building an additional list of words for each letter, so your dictionary might look like:
wordsbyletter = { 'a' : ['aardvark', 'abacus', ... ],
'b' : ['bat', 'bar', ...],
.... }
However, this would be of limited use, particularly if your query string starts with an unknown character. So we can do even better by noting where in a given word a particular letter lies, generating:
wordsmap = { 'a':{ 0:['aardvark', 'abacus'],
1:['bat','bar']
2:['abacus']},
'b':{ 0:['bat','bar'],
1:['abacus']},
....
}
As you can see, without using indices, you will end up hugely increasing the amount of required storage space - specifically a dictionary of n words and average length m will require nm2 of storage. However, you could very quickly now do your look up to get all the words from each set that can match.
The final optimisation (which you could use off the bat on the naive approach) is to also separate all the words of the same length into separate stores, since you always know how long the word is.
This version would be O(kx) where k is the number of known letters in the query word, and x=x(n) is the time to look up a single item in a dictionary of length n in your implementation (usually log(n).
So with a final dictionary like:
allmap = {
3 : {
'a' : {
1 : ['ant','all'],
2 : ['bar','pat']
}
'b' : {
1 : ['bar','boy'],
...
}
4 : {
'a' : {
1 : ['ante'],
....
Then our algorithm is just:
possiblewords = set()
firsttime = True
wordlen = len(query)
for idx,letter in enumerate(query):
if(letter is not '?'):
matchesthisletter = set(allmap[wordlen][letter][idx])
if firsttime:
possiblewords = matchesthisletter
else:
possiblewords &= matchesthisletter
At the end, the set possiblewords will contain all the matching letters.
If you generate all the possible words that match the pattern (arate, arbte, arcte ... zryte, zrzte) and then look them up in a binary tree representation of the dictionary, that will have the average performance characteristics of O(e^N1 * log(N2)) where N1 is the number of question marks and N2 is the size of the dictionary. Seems good enough for me but I'm sure it's possible to figure out a better algorithm.
EDIT: If you will have more than say, three question marks, have a look at Phil H's answer and his letter indexing approach.
Assume you have enough memory, you could build a giant hashmap to provide the answer in constant time. Here is a quick example in python:
from array import array
all_words = open("english-words").read().split()
big_map = {}
def populate_map(word):
for i in range(pow(2, len(word))):
bin = _bin(i, len(word))
candidate = array('c', word)
for j in range(len(word)):
if bin[j] == "1":
candidate[j] = "?"
if candidate.tostring() in big_map:
big_map[candidate.tostring()].add(word)
else:
big_map[candidate.tostring()] = set([word])
def _bin(x, width):
return ''.join(str((x>>i)&1) for i in xrange(width-1,-1,-1))
def run():
for word in all_words:
populate_map(word)
run()
>>> big_map["y??r"]
set(['your', 'year'])
>>> big_map["yo?r"]
set(['your'])
>>> big_map["?o?r"]
set(['four', 'poor', 'door', 'your', 'hour'])
You can take a look at how its done in aspell. It prompts suggestions of correct word for misspelled words.
Build a hash set of all the words. To find matches, replace the question marks in the pattern with each possible combination of letters. If there are two question marks, a query consists of 262 = 676 quick, constant-expected-time hash table lookups.
import itertools
words = set(open("/usr/share/dict/words").read().split())
def query(pattern):
i = pattern.index('?')
j = pattern.rindex('?') + 1
for combo in itertools.product('abcdefghijklmnopqrstuvwxyz', repeat=j-i):
attempt = pattern[:i] + ''.join(combo) + pattern[j:]
if attempt in words:
print attempt
This uses less memory than my other answer, but it gets exponentially slower as you add more question marks.
If 80-90% accuracy is acceptable, you could manage with Peter Norvig's spell checker. The implementation is small and elegant.
A regex-based solution will consider every possible value in your dictionary. If performance is your largest constraint, an index could be built to speed it up considerably.
You could start with an index on each word length containing an index of each index=character matching word sets. For length 5 words, for example, 2=r : {write, wrote, drate, arete, arite}, 3=o : {wrote, float, group}, etc. To get the possible matches for the original query, say '?ro??', the word sets would be intersected resulting in {wrote, group} in this case.
This is assuming that the only wildcard will be a single character and that the word length is known up front. If these are not valid assumptions, I can recommend n-gram based text matching, such as discussed in this paper.
The data structure you want is called a trie - see the wikipedia article for a short summary.
A trie is a tree structure where the paths through the tree form the set of all the words you wish to encode - each node can have up to 26 children, on for each possible letter at the next character position. See the diagram in the wikipedia article to see what I mean.
Have you considered using a Ternary Search Tree?
The lookup speed is comparable to a trie, but it is more space-efficient.
I have implemented this data structure several times, and it is a quite straightforward task in most languages.
My first post had an error that Jason found, it did not work well when ?? was in the beginning. I have now borrowed the cyclic shifts from Anna..
My solution:
Introduce an end-of-word character (#) and store all cyclic shifted words in sorted arrays!! Use one sorted array for each word length. When looking for "th??e#", shift the string to move the ?-marks to the end (obtaining e#th??) and pick the array containing words of length 5 and make a binary search for the first word occurring after string "e#th". All remaining words in the array match, i.e., we will find "e#thoo (thoose), e#thes (these), etc.
The solution has time complexity Log( N ), where N is the size of the dictionary, and it expands the size of the data by a factor of 6 or so ( the average word length)
Here's how I'd do it:
Concatenate the words of the dictionary into one long String separated by a non-word character.
Put all words into a TreeMap, where the key is the word and the value is the offset of the start of the word in the big String.
Find the base of the search string; i.e. the largest leading substring that doesn't include a '?'.
Use TreeMap.higherKey(base) and TreeMap.lowerKey(next(base)) to find the range within the String between which matches will be found. (The next method needs to calculate the next larger word to the base string with the same number or fewer characters; e.g. next("aa") is "ab", next("az") is "b".)
Create a regex for the search string and use Matcher.find() to search the substring corresponding to the range.
Steps 1 and 2 are done beforehand giving a data structure using O(NlogN) space where N is the number of words.
This approach degenerates to a brute-force regex search of the entire dictionary when the '?' appears in the first position, but the further to the right it is, the less matching needs to be done.
EDIT:
To improve the performance in the case where '?' is the first character, create a secondary lookup table that records the start/end offsets of runs of words whose second character is 'a', 'b', and so on. This can be used in the case where the first non-'?' is second character. You can us a similar approach for cases where the first non-'?' is the third character, fourth character and so on, but you end up with larger and larger numbers of smaller and smaller runs, and eventually this "optimization" becomes ineffective.
An alternative approach which requires significantly more space, but which is faster in most cases, is to prepare the dictionary data structure as above for all rotations of the words in the dictionary. For instance, the first rotation would consist of all words 2 characters or more with the first character of the word moved to the end of the word. The second rotation would be words of 3 characters or more with the first two characters moved to the end, and so on. Then to do the search, look for the longest sequence of non-'?' characters in the search string. If the index of the first character of this substring is N, use the Nth rotation to find the ranges, and search in the Nth rotation word list.
A lazy solution is to let SQLite or another DBMS do the job for you.
Just create an in-memory database, load your words and run a select using the LIKE operator.
Summary: Use two compact binary-searched indexes, one of the words, and one of the reversed words. The space cost is 2N pointers for the indexes; almost all lookups go very fast; the worst case, "??e", is still decent. If you make separate tables for each word length, that'd make even the worst case very fast.
Details: Stephen C. posted a good idea: search an ordered dictionary to find the range where the pattern can appear. This doesn't help, though, when the pattern starts with a wildcard. You might also index by word-length, but here's another idea: add an ordered index on the reversed dictionary words; then a pattern always yields a small range in either the forward index or the reversed-word index (since we're told there are no patterns like ?ABCD?). The words themselves need be stored only once, with the entries of both structures pointing to the same words, and the lookup procedure viewing them either forwards or in reverse; but to use Python's built-in binary-search function I've made two separate strings arrays instead, wasting some space. (I'm using a sorted array instead of a tree as others have suggested, as it saves space and goes at least as fast.)
Code:
import bisect, re
def forward(string): return string
def reverse(string): return string[::-1]
index_forward = sorted(line.rstrip('\n')
for line in open('/usr/share/dict/words'))
index_reverse = sorted(map(reverse, index_forward))
def lookup(pattern):
"Return a list of the dictionary words that match pattern."
if reverse(pattern).find('?') <= pattern.find('?'):
key, index, fixup = pattern, index_forward, forward
else:
key, index, fixup = reverse(pattern), index_reverse, reverse
assert all(c.isalpha() or c == '?' for c in pattern)
lo = bisect.bisect_left(index, key.replace('?', 'A'))
hi = bisect.bisect_right(index, key.replace('?', 'z'))
r = re.compile(pattern.replace('?', '.') + '$')
return filter(r.match, (fixup(index[i]) for i in range(lo, hi)))
Tests: (The code also works for patterns like ?AB?D?, though without the speed guarantee.)
>>> lookup('hello')
['hello']
>>> lookup('??llo')
['callo', 'cello', 'hello', 'uhllo', 'Rollo', 'hollo', 'nullo']
>>> lookup('hel??')
['helio', 'helix', 'hello', 'helly', 'heloe', 'helve']
>>> lookup('he?l')
['heal', 'heel', 'hell', 'heml', 'herl']
>>> lookup('hx?l')
[]
Efficiency: This needs 2N pointers plus the space needed to store the dictionary-word text (in the tuned version). The worst-case time comes on the pattern '??e' which looks at 44062 candidates in my 235k-word /usr/share/dict/words; but almost all queries are much faster, like 'h??lo' looking at 190, and indexing first on word-length would reduce '??e' almost to nothing if we need to. Each candidate-check goes faster than the hashtable lookups others have suggested.
This resembles the rotations-index solution, which avoids all false match candidates at the cost of needing about 10N pointers instead of 2N (supposing an average word-length of about 10, as in my /usr/share/dict/words).
You could do a single binary search per lookup, instead of two, using a custom search function that searches for both low-bound and high-bound together (so the shared part of the search isn't repeated).
If you only have ? wildcards, no * wildcards that match a variable number of characters, you could try this: For each character index, build a dictionary from characters to sets of words. i.e. if the words are write, wrote, drate, arete, arite, your dictionary structure would look like this:
Character Index 0:
'a' -> {"arete", "arite"}
'd' -> {"drate"}
'w' -> {"write", "wrote"}
Character Index 1:
'r' -> {"write", "wrote", "drate", "arete", "arite"}
Character Index 2:
'a' -> {"drate"}
'e' -> {"arete"}
'i' -> {"write", "arite"}
'o' -> {"wrote"}
...
If you want to look up a?i?? you would take the set that corresponds to character index 0 => 'a' {"arete", "arite"} and the set that corresponds to character index 2 = 'i' => {"write", "arite"} and take the set intersection.
If you seriously want something on the order of a billion searches per second (though i can't dream why anyone outside of someone making the next grand-master scrabble AI or something for a huge web service would want that fast), i recommend utilizing threading to spawn [number of cores on your machine] threads + a master thread that delegates work to all of those threads. Then apply the best solution you have found so far and hope you don't run out of memory.
An idea i had is that you can speed up some cases by preparing sliced down dictionaries by letter then if you know the first letter of the selection you can resort to looking in a much smaller haystack.
Another thought I had was that you were trying to brute-force something -- perhaps build a DB or list or something for scrabble?

Resources