I named it as "AI way" because I'm thinking make Application to play the hangman game without human being interactive.
The scenario is like this:
a available word list which would contains hundreds of thousands English word.
The Application will pick certain amount of words, e.g 20 from the list.
The Application play Hangman against each word until either WON or FAILURE.
The restriction here is max wrong bad guess.
26 does not make sense obviously and let's say 6 for the max wrong guess.
I tried the strategy mentioned at wiki page but it does not work well.
Basically successful rate is about 30%.
Any suggestions / comments regarding strategy as well as which field I should dig in order to find a fair good strategy?
Thanks a lot.
-Simon
PS: A JavaScript implementation which looks fairly well.
(https://github.com/freizl/play-hangman-game)
Updated Idea
Download a dictionary of words and put it into some database or structure of your choice
When presented with a word, narrow your guesses to words of the same length and perform a letter frequency distribution (you can use a dictionary and/or list collection for fast distribution analysis and sorting)
Pick the most common letter from this list
If the letter is found, create a regex pattern based on the known letter(s) and the word length and repeat from step 2
You should be able to quickly narrow down a single word resulting from your pattern search
For posterity:
Take a look at this wiki page. It includes a table of frequencies of the first letters of words which may help you tune your algorithm.
You could also take into account the fact that if you find a vowel or two in a word the likelihood of finding other vowels will decrease significantly and you should then try more common consonants instead. The example from the wiki page you listed start with E then T and then tries three vowels in a row: A, O and I. The first two letters are missed but once the third letter is found, twice then the process should switch to common consonants and skip trying for more vowels since there will likely be fewer.
Any useful strategies will certainly employ frequency distribution charts on letters and possibly words e.g. some words are very common while others are rarely used so performing a letter frequency distribution on a set of more common words might help... guessing that some words may appear more frequently than other but that depends on your word selection algorithm which might not take into account "common" usage.
You could also build specialized letter frequency tables and possibly even on-the-fly. For example, given the wikipedia h a ngm a n example: You find the letter A twice in a word in two locations 2nd and 6th. You know that the word has seven letters and with a fairly simple reg ex you could isolate the words from a dictionary that match this pattern:
_ a _ _ _ a _
Then perform a letter frequency on that set of words that matches this pattern and use that set for your next guess. Rinse and repeat. I think doing some of those things I mentioned but especially the last will really increase your odds of success.
The strategies in the linked page seem to be "order guesses by letter frequency" and "guess the vowels, then order guesses by letter frequency"
A couple of observations about hangman:
1) Since guessing a letter that isn't in the word hurts us, we should guess letters by word frequency (percentage of words that contain letter X), not letter frequency (number of times that X appears in all words). This should maximise our chances of guessing a bad letter.
2) Once we've guessed some letters correctly, we know more about the word we're trying to guess.
Here are two strategies that should beat the letter frequency strategy. I'm going to assume we have a dictionary of words that might come up.
If we expect the word to be in our dictionary:
1) We know the length of the target word, n. Remove all words in the dictionary that aren't of length n
2) Calculate the word frequency of all letters in the dictionary
3) Guess the most frequent letter that we haven't already guessed.
4) If we guessed correctly, remove all words from the dictionary that don't match the revealed letters.
5) If we guessed incorrectly, remove all words that contain the incorrectly guessed letter
6) Go to step 2
For maximum effect, instead of calculating word frequencies of all letters in step 2, calculate the word frequencies of all letters in positions that are still blank in the target word.
If we don't expect the word to be in our dictionary:
1) From the dictionary, build up a table of n-grams for some value of n (say 2). If you haven't come across n-grams before, they are groups of consecutive letters inside the word. For example, if the word is "word", the 2-grams are {^w,wo,or,rd,d$}, where ^ and $ mark the start and the end of the word. Count the word frequency of these 2-grams.
2) Start by guessing single letters by word frequency as above
3) Once we've had some hits, we can use the table of word frequency of n-grams to determine either letters to eliminate from our guesses, or letters that we're likely to be able to guess. There are a lot of ways you could achieve this:
For example, you could use 2-grams to determine that the blank in w_rd is probably not z. Or, you could determine that the character at the end of the word ___e_ might (say) be d or s.
Alternatively you could use the n-grams to generate the list of possible characters (though this might be expensive for long words). Remember that you can always cross off all n-grams that contain letters you've guessed that aren't in the target word.
Remember that at each step you're trying not to make a wrong guess, since that keeps us alive. If the n-grams tell you that one position is likely to only be (say) a,b or c, and your word frequency table tells you that a appears in 30% of words, but b and c only appear in 10%, then guess a.
For maximum benefit, you could combine the two strategies.
The strategy discussed is one suitable for humans to implement. Since you're writing an AI you can throw computing power at it to get a better result.
Take your word list, filter it down to only those words which match what information you have about the target word. (At the start that will only be the word length.) For each letter A through Z note how many words contain at least one of them (this is different than the count of the letters.) Pick the letter with the highest score.
You MIGHT even be able to run multiple cycles of this in computing a guess but that might prove too much even for modern CPUs.
Clarification: I'm saying that you might be able to run a look-ahead. If we choose "A" at this level what options does that present for the next level? This is an O(x^n) algorithm, obviously you can't go too far down that path.
Related
Given a list of lowercase radom words, each word with same length, and many patterns each with some letters at some positions are specified while other letters are unknown, find out all words that matches each pattern.
For example, words list is:
["ixlwnb","ivknmt","vvqnbl","qvhntl"]
And patterns are:
i-----
-v---l
-v-n-l
With a naive algorithm, one can do an O(NL) travel for each pattern, where N is the words count and L is the word length.
But since there may be a lot of patterns travel on the same words list, is there any good data structure to preprocess and store the words list, then give a sufficient matching for all patterns?
One simple idea is to use an inverted index. First, number your words -- you'll refer to them using these indices rather than the words themselves for speed and space efficiency. Probably the index fits in a 32-bit int.
Now your inverted index: for each letter in each position, construct a sorted list of IDs for words that have that letter in that location.
To do your search, you take the lists of IDs for each of the letters in the positions you're given, and take the intersection of the lists, using a an algorithm like the "merge" in merge-sort. All IDs in the intersection match the input.
Alternatively, if your words are short enough (12 characters or fewer), you could compress them into 64 bit words (using 5 bits per letter, with letters 1-26). Construct a bit-mask with binary 11111 in places where you have a letter, and 00000 in places where you have a blank. And a bit-test from your input with the 5-bit code for each letter in each place, using 00000 where you have blanks. For example, if your input is a-c then your bitmask will be binary 111110000011111 and your bittest binary 000010000000011. Go through your word-list, take this bitwise and of each word with the bit-mask and test to see if it's equal to the bit-test value. This is cache friendly and the inner loop is tight, so may be competitive with algorithms that look like they should be faster on paper.
I'll preface this with it's more of a comment and less of an answer (I don't have enough reputation to comment though). I can't think of any data structure that will satisfy the requirements of of the box. It was interesting to think about, and figured I'd share one potential solution that popped into my head.
I keyed in on the "same length" part, and figured I could come up with something based on that.
In theory we could have N(N being the length) maps of char -> set.
When strings are added, it goes through each character and adds the string to the corresponding set. psuedocode:
firstCharMap[s[0]].insert(s);
secondCharMap[s[1]].insert(s);
thirdCharMap[s[2]].insert(s);
fourthCharMap[s[3]].insert(s);
fifthCharMap[s[4]].insert(s);
sixthCharMap[s[5]].insert(s);
Then to determine which strings match the pattern, we take just do an intersection of the sets ex: "-v-n-l" would be:
intersection of sets: secondCharMap[v], fourthCharMap[n], sixthCharMap[l]
One edge case that jumps out is if I wanted to just get all of the strings, so if that's a requirement--we may also need an additional set of all of the strings.
This solution feels clunky, but I think it could work. Depending on the language, number of strings, etc--I wouldn't be surprised if it performed worse than just iterating over all strings and checking a predicate.
I am looking for an algorithm, but I don't know the name of the problem so I can't find anything. Hopefully my explanation of the problem makes sense!
Let's say you have a long list of phrases, where each phrase is a set of words. The user inputs a list of words, and their list "matches" a phrase every word in the phrase is found in their list. A list's "score" is the number of phrases it matches. The goal is to provide the user with a list of words that would most improve their list's score.
Here's a simple example. We have ten phrases:
wood cabin
camping in woods
camping cabin
fun camping
bon fire
camping fire
swimming hole
fun cabin
wood fire
fire place
And the user provides this list:
wood
fun
camping
We match phrases 1 and 4, so the score is 2. But if the user adds "cabin" to their list, they will match 3 more phrases and get a score of 5. "fire" would add 2 to the score.
With the trivially short list, there isn't any complicated problem, as you can just iterate through the options in almost no time. But as the list grows to the hundreds of thousands, it starts taking hundreds of milliseconds. It feels like there should be a way to build an index to make the process faster, but I can't think of what the index's structure would be.
Anyone who took the time to read all this, thank you! Hopefully someone knows what I'm talking about.
You need to map words to number of occurrences. If you use a hash table you can do it very quickly (O(N) - with N being the number of words in the phrases) - loop over all phrases, break them into words, if the word is already in the map increment its count, if not - add it to the map with count 1.
To compute the score of the input, just loop over the input words and accumulate the number of occurrences. O(M) - this time M being the number of input words.
I doubt you can get better complexity (you need to scan the phrases at least once), and with a proper implementation of a map (available in almost all modern languages) - it will be fast as well.
Suffix tree.
They're rather fiddly and complicated things, but basically we store a node for each character (26 * 2), then we store the suffixes for each character, so entries for th and an and so on, but presumably not for qj or other combinations which won't occur. Then you get the suffixes for those, (so the, thr, and and so on, but plenty of combinations of three letters disallowed).
It allows for very fast searching, which doesn't have to be exact. If we want to match a*d we simply follow all the suffixes of a, then only d suffixes, then we insist on nul.
I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?
First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).
Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))
Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.
If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.
Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.
Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.
The question is basically "how do I generate a good grid for the game 'Boggle' with lots of words" where good is defined as having lots of words of 5 or more letters.
Boggle is a game where you roll dice with letters on them, they are placed in a 4x4 grid. Example:
H S A V
E N I S
K R G I
S O L A
Words can be made by connecting letters horizontally, vertically or diagonally. In the good example grid above you can make the words "VANISHERS", "VANISHER", "KNAVISH", "ALIGNERS", "SAVINGS", "SINKERS" and around 271 other words depending on the dictionary used, for example "AS", "I", "AIR", "SIN", "IS", etc...
As a bad example this grid
O V W C
T K Z O
Y N J H
D E I E
only has ~44 words only 2 of which are > 4 letters long. "TYNED" and "HINKY".
There's lots of similar questions but AFAICT not this exact question. This is obviously a reference to the game "Scramble with Friends".
The first solution, picking letters at random, has the problem that if you accidently pick all consonants there will be no words. Adding a few random vowels is not enough to guarantee a good set of words. You might only get 1 to 4 letter words whereas a good algorithm will choose a set of letters that has > 200 words with many words > 7 letters.
I'm open to any algorithm. Obviously I could write code to brute force solutions finding every possible grid and then sorting them by grids with the most words but that simple solution would take forever to run.
I can imagine various heuristics like choosing a long word (8-16 letters), putting those letters in the grid at random but in a way that can actually still make the word and then filling in the left over spaces. I suspect that's also not enough to guarantee a good set of words though I haven't tried it yet.
It's possible the solution requires pre-processing a dictionary to know common parts of words. For example all words that end in "ing" or "ers" or "ght" or "tion" or "land". Or somehow organizing them into a graph of shared letters. Maybe weighting certain sets of letters so "ing" or "ers" are inserted often.
Ideas?
Short of the brute-force search proposal there is probably no way to guarantee that you have a good grid. If you use the letter frequency as found on the Boggle dice, then you will get 'average' grids (exactly as if you roll the dice). You could improve this by adding extra heuristics or filters, for example:
ensure that (almost) every consonant is 'in-reach-of' a vowel
ensure 'Q' is 'in-reach-of' a 'U'
ensure the ratio of vowels to consonants is within a set range
ensure the number of rare consonants is not too large
etc
Then you could
set letters using weighted letter frequency
change (swap/replace) letters not meeting your heuristics
It would still be possible for a bad grid to get through unless you checked via brute-force, but you may be able to reduce the number of bad grids substantially from those returned by a simple randomly generated grid.
Alternately, generate random grids and do the brute force work as required to pick good grids. But do this in the background (days or weeks before needed). Then store a bunch of good grids and choose one randomly as required when needed (and cross it off your list so you don't see it again).
The way Boggle works is that there are six-sided die with certain letters on the side. Those die are randomly assigned to the 16 squares and then rolled. Common letters occur on more faces of the dice. Search around - you may be able to get the exact set of dice.
Calculate statistical letter frequency and letter-pair frequencies from the dictionary.
Start from randomly choosing one of the four central squares
Randomly choose a letter for that square weighted by single letter frequency.
Recursively:
4.1. Randomly choose one of all the empty connected squares.
4.2. Randomly choose a letter for that square weighted by the combination, (average), of the dual letter frequencies of any connected filled square and the single letter frequencies of any connected empty square.
Et voila!
P.S. You might also want to experiment with adding a global letter derating based on its current count of appearances in the grid to 4.2.
I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.
You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.
I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.
You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.
Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.
Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.
Sounds like a fun problem, good luck!
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
parseWords(compoundWord)
# Make the graph
graph = makeGraph()
N = compoundWord.length
for index = 0 to N
graph.addVertex(i)
# Add the edges for each word
for index = 0 to N - 1
for length = 1 to min(N - index, MAX_WORD_LENGTH)
potentialWord = compoundWord.substr(index, length)
if dictionary.isElement(potentialWord)
graph.addEdge(index, index + length, 1)
# Now find a list of edges which define the shortest path
edges = graph.shortestPath(0, N)
# Change these edges back into words.
result = makeList()
for e in edges
result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
return result
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
PenIsland
KidsExchange
TherapistFinder
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.
Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:
accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes
Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:
from __future__ import with_statement
def opendict(dictionary=r"g:\words\words(3).txt"):
with open(dictionary, "r") as f:
return set(line.strip() for line in f)
if __name__ == '__main__':
s = opendict()
for word in sorted(s):
if len(word) >= 10:
for i in range(4, len(word)-4):
left, right = word[:i], word[i:]
if (left in s) and (right in s):
if right not in ('nesses', ):
print word, left, right
It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.
A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.
A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).
Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:
# python-ish pseudocode
def splitword(word):
# word is a character array indexed from 0..n-1
for i from 1 to n-1:
head = word[:i] # first i characters
tail = word[i:] # everything else
if is_word(head):
if i == n-1:
return [head] # this was the only valid word; return it as a 1-element list
else:
rest = splitword(tail)
if rest != []: # check whether we successfully split the tail into words
return [head] + rest
return [] # No successful split found, and 'word' is not a word.
Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.
Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.
This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).
We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:
ethylmethylketone
where the morphemes are:
ethyl methyl and ketone
We tackle this through automata and maximum entropy and the code is available on Sourceforge
http://www.sf.net/projects/oscar3-chem
but be warned that it will take some work.
We sometimes encounter ambiguity and are still finding a good way of reporting it.
To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.
As another example the string
weeknight
can be parsed as
wee knight
or
week night
Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).
I would use the following algorithm.
Start with the sorted list of words
to split, and a sorted list of
declined words (dictionary).
Create a result list of objects
which should store: remaining word
and list of matched words.
Fill the result list with the words
to split as remaining words.
Walk through the result array and
the dictionary concurrently --
always increasing the least of the
two, in a manner similar to the
merge algorithm. In this way you can
compare all the possible matching
pairs in one pass.
Any time you find a match, i.e. a
split words word that starts with a
dictionary word, replace the
matching dictionary word and the
remaining part in the result list.
You have to take into account
possible multiples.
Any time the remaining part is empty,
you found a final result.
Any time you don't find a match on
the "left side", in other words,
every time you increment the result
pointer because of no match, delete
the corresponding result item. This
word has no matches and can't be
split.
Once you get to the bottom of the
lists, you will have a list of
partial results. Repeat the loop
until this is empty -- go to point 4.