How to find the correct "word part" records that make up an input "word string", given a word part dataset? - algorithm

In agglutinative languages, "words" is a fuzzy concept. Some agglutinative languages are like Turkish, Inuktitut, and many Native American languages (amongst others). In them, "words" are often/usually composed of a "base", and multiple prefixes/suffixes. So you might have ama-ebi-na-mo-kay-i-mang-na (I just made that up), where ebi is the base, and the rest are affixes. Let's say this means "walking early in the morning when the birds start singing", ama/early ebi/walk na/-ing mo/during kay/bird i/plural mang/sing na-ing. These words can get quite long, like 30+ "letters".
So I was playing around with creating a "dictionary" for a language like this, but it's not realistic to write definitions or "dictionary entries" as your typical English "words", because there are a possibly infinite number of words! (All combinations of prefixes/bases/suffixes). So instead, I was trying to think maybe you could have just these "word parts" in the database (prefixes/suffixes/bases, which can't stand by themselves actually in the real spoken language, but are clearly distinct in terms of adding meaning). By having a database of word parts, you would then (in theory) query by passing as input a long say 20-character "word", and it would figure out how to break this word down into word parts because of the database (somehow).
That is, it would take amaebinamokayimangna as input, and know that it can be broken down into ama-ebi-na-mo-kay-i-mang-na, and then it simply queries the database for those parts to return whatever metadata is associated with those parts.
What would you need to do to accomplish this basically, at a high level? Assuming you had a database (SQL or just in a text file) containing these affixes and bases, how could you take the input and know that it breaks down into these parts organized in this way? Maybe it turns out there is are other parts in the DB which can be arrange like a-ma-e-bina-mo-kay-im-ang-na, which is spelled the the exact same way (if you remove the hyphens), so it would likely find that as a result too, and return it as another possible match.
The only way (naive way) I can think of solving this currently, is to break the input string into ngrams like this:
function getNgrams(str, { min = 1, max = 8 } = {}) {
const ngrams = []
const points = Array.from(str)
const n = points.length
let minSize = min
while (minSize <= max) {
for (let i = 0; i < (n - minSize + 1); i++) {
const ngram = points.slice(i, i + minSize)
ngrams.push(ngram.join(''))
}
minSize++
}
return ngrams
}
And it would then check the database if any of those ngrams exist, maybe passing in if this is a prefix (start of word), infix, or suffix (end of word) part. The database parts table would have { id, text, is_start, is_end } sort of thing. But this would be horribly inefficient and probably wouldn't work. It seems really complex how you might go about solving this.
So wondering, how would you solve this? At a high level, what is the main vision you see of how you would tackle this, either in a SQL database or some other approach?
The goal is, save to some persisted area the word parts, and how they are combined (if they are a prefix/infix/suffix), and then take as input a string which could be generated from those parts, and try and figure out what the parts are from the persisted data, and then return those parts in the correct order.

First consider the simplified problem where we have a combination of prefixes only. To be able to split this into prefixes, we would do:
Store all the prefixes in a trie.
Let's say the input has n characters. Create an array of length n (of numbers, if you need just one possible split, or sets of numbers, if you need all possible splits). We will store in this array for each index, from which positions of the input string this index can be reached by adding a prefix from the dictionary.
For each substring starting with the 1st character of the input, if it belongs to the Trie, mark the index as can be reached from 0th position (i.e. there is a path from 0th position to k-th position). Trie allows us to do this in O(n)
For all i = 2..n, if the i-th character can be reached from the beginning, repeat the previous step for the substrings starting at i, mark their end position as "can be reached from (i-1)th position" as appropriate (i.e. there is a path from (i-1)th position to ((i-1)+k)th position).
At the end, we can traverse these indices backwards, starting at the end of the array. Each time we jump to an index stored in the array, we are skipping a prefix in the dictionary. Each path from the last position to the first position gives us a possible split. Since we repeated the 4-th step only for positions that can be reached from the 0-th position, all paths are guaranteed to end up at the 0-th position.
Building the array takes O(n^2) time (assuming we have the trie built already). Traversing the array to find all possible splits is O(n*s), where s is the number of possible splits. In any case, we can say if there is a possible split as soon as we have built the array.
The problem with prefixes, suffixes and base words is a slight modification of the above:
Build the "previous" indices for prefixes and "next" for suffixes (possibly starting from the end of the input and tracking the suffixes backwards).
For each base word in the string (all of which we can also find efficiently -O(n^2)- using a trie) see if the starting position can be reached from the left using prefixes, and end position can be reached from right using suffixes. If yes, you have a split.
As you can see, the keywords are trie and dynamic programming. The problem of finding only a single split requires O(n^2) time after the tries are built. Tries can be built in O(m) time where m is the total length of added strings.

Related

scrabble solving with maximum score

I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?
First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).
Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))
Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.
If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.
Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.
Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Binary String Search - minimum bin width?

I happen to be building the binary search in Python, but the question has more to do with binary search structure in general.
Let's assume I have about one thousand eligible candidates I am searching through using binary search, doing the classic approach of bisecting the sorted dataset and repeating this process in order to narrow down the eligible set to iterate over. The candidates are just strings of names,(first-last format, eg "Peter Jackson") I initially sort the set alphabetically and then proceed with bisection using something like this:
hi = len(names)
lo = 0
while lo < hi:
mid = (lo+hi)//2
midval = names[mid].lower()
if midval < query.lower():
lo = mid+1
elif midval > query.lower():
hi=mid
else:
return midval
return None
This code adapted from here: https://stackoverflow.com/a/212413/215608
Here's the thing, the above procedure assumes a single exact match or no result at all. What if the query was merely for a "Peter", but there are several peters with differing last names? In order to return all the Peters, one would have to ensure that the bisected "bins" never got so small as to except eligible results. The bisection process would have to cease and cede to something like a regex/regular old string match in order to return all the Peters.
I'm not so much asking how to accomplish this as what this type of search is called... what is a binary search with a delimited criteria for "bin size" called? Something that conditionally bisects the dataset, and once the criteria is fulfilled, falls back to some other form of string matching in order to ensure that there can effectively be a ending wildcard on the query (so a search for a "Peter" will get "Peter Jacksons" and "Peter Edwards")
Hopefully I've been clear what I mean. I realize in the typical DB scenario the names might be separated, this is just intended as a proof of concept.
I've not come across this type of two-stage search before, so don't know whether it has a well-known name. I can, however, propose a method for how it can be carried out.
Let say you've run the first stage and have found no match.
You can perform the second stage with a pair of binary searches and a special comparator. The binary searches would use the same principle as bisect_left and bisect_right. You won't be able to use those functions directly since you'll need a special comparator, but you can use them as the basis for your implementation.
Now to the comparator. When comparing the list element x against the search key k, the comparator would only use x[:len(k)] and ignore the rest of x. Thus when searching for "Peter", all Peters in the list would compare equal to the key. Consequently, bisect_left() to bisect_right() would give you the range containing all Peters in the list.
All of this can be done using O(log n) comparisons.
In your binary search you either hit an exact match OR an area where the match would be.
So in your case you need to get the upper and lower boundaries (hi lo as you call them) for the area that would include the Peter and return all the intermediate strings.
But if you aim to do something like show next words of a word you should look into Tries instead of BSTs

Help me optimize my indexed string search

So as an exercise, I'm building an algorithm to make searching for words (arbitrary sets of characters) within larger strings as fast as possible. With almost no previous knowledge of existing search algorithms, my approach so far has been the following:
Map out occurrences of pairs of characters within the larger string (Pair -> List of positions).
For each pair, store also the number of occurrences found within the larger string.
Get all character pairs within the search word.
Using the gotten pair that occurs least often in the string, check at each position for the remaining characters of the search term for a match.
That's the gist of it. I suppose I could use maps with longer characters, but for now I'm just using pairs.
Is there much else I can do to make it faster? Am I approaching this the right way?
String-search is a heavily researched topic:
http://en.wikipedia.org/wiki/String_searching_algorithm
You are thinking about finding e.g. 2 consecutive characters and storing the frequency of that combination, this is a very expensive operation even if you use balancing datastructures. I dont really see how storing consecutive characters as a preprocessing-step would help you.
So, there are obviously many many algorithms for string-search. What i find interesting is, that there are some algorithms that dont even need to scan every character in the body-of-text. Example: if you search for the word 'abbbbbc' and you find the character 'd' as the next character of the body-of-text you can immediately jump ahead 5 characters without the need to even look what they are, then if the next character is a 'b' or 'c' you obviously have to go back and look if you made a mistake in jumping, but if not then you skipped over 5 characters with no need for comparison. This is difficult to implement however and leads to the theory of finite automata.
The other answer refers to Boyer Moore algorithm. It uses pre-processing on a substring, not the search material. It is considered a fast non-indexed search.
There are other search algorithms useful for searching the same document multiple times. Consider Aho-Corasick.
However, an algorithm that I have used that is similar to which you describe is implemented in Python below. The idea is that for each element, get a Set of the indices where it appears. Then for searching, use the least common character in substring as a list of reference points. Loop through the substring for each reference point, breaking when index cannot be found.
You should test this to see if its faster than Boyer Moore or other algorithms. This one may be able to be improved. Would be a good exercise, five years late.
class IndexRegistry(object):
def __init__(self, iterable):
self._store = defaultdict(set)
for index, val in enumerate(iterable):
self._store[ val ].add( index )
def find(self, items):
_, refIndex, refItem = min(
[ ( len(self._store[item]), index, item )
for index, item in enumerate(items)
],
key=lambda el: el[0]
)
for referenceIndex in self._store[ refItem ]:
startIndex = referenceIndex - refIndex
for itemIndex, item in enumerate(items):
absIndex = startIndex + itemIndex
if absIndex not in self._store[ item ]:
break #substring broken
else: #no break
yield startIndex
def __contains__(self, items):
try:
next(self.find(items))
except StopIteration:
return False
return True

How to get a group of names all with the same first letter from an alphabetically sorted list?

I was wondering what is the best way to get a group of names given the first letter. The present application I am working in is in javascript, but I had a similar problem in another language sometimes ago. One idea I have thought of would be to do a binary search for the end of the names from a particular letter and then do another binary search for the beginning. Another idea was to take the ratio of of the distance of the given letter from the beginning and applying that ratio to find where to start the search. For example if the letter was 'e' then I would start start a quarter of the way through the list, and do some kind of search to see how close I am to the letter I need. The program will be working with several hundred names so I really didn't want to just do a for loop and search the whole thing. Also, I am interested what kind of algorithms for this are out there?
Both your approaches have their advantages and disadvantages. Binary search gives exactly O(log(N)) complexity, and your second method will give approximately O(log(N)) with some advantage for uniform distribution of names and possibly disadvantage for another type of distribution. What is better is up to your needs.
One big improvement I can propose is to index character positions while creating names list. Make simple hash map with first letters as keys and start positions as values. It will take O(N), but only once, and then you will get exact position for each letter in a constant time. For JavaScript you can do it, for example, while loading data to the page, when you walk trough the list anyway.
Guys I think we could use an approach similar to count sort.We could create an array of size 26 .This array would not be an normal array but would be an array of pointers to linked list which has the following structure.
Struct node
{
char *ptr ;
struct node *next;
};
struct node * names[26]; //Our array.
Now we would scan the list in O(n) time and corresponding to the first character we could subtract 65 (if ASCII value of letter is in the range 65 - 90).Guys i am subtracting 65 so as to fix the letter in 26 sized array.
At each location we could create a linked list and can store the corresponding words in that location.
Now suppose if we want to find all letters that begin with D we could directly do to array location 3(No need to apply hash function again) and then traverse linked list created till null is reached.
And what i think space complexity required in hashing would be same as that of above but hashing would also involve computing hash function every time when we want to insert or search for words beginning with same letter.
If the plan is to do something with the names (as opposed to just find out how many there are), then it will be necessary to scan the names that fit the criteria of matching the first letter. If so, then it seems that a binary search for the first name in the entire set is the fastest method. The "do something" part would involve scanning the names starting from the location found by the binary search. When a name is read that no longer starts with the given letter, you are done.
If you have an unsorted set of filenames then I would propose following algorithm:
1) Create two variables: 1) currently found first letter (I will call it currentLetter) 2) list of filenames which start with this letter (currentFilenames)
2) firstLetter = null
currentFilenames = [] - empty list or array
3) Iterate over filenames. If current filenames starts with currentLetter then add this filename to the currentFilenames. If it starts with letter which goes before currentLetter then assign currentLetter to the first letter of new filename and create a new currentFilenames list which consists only of one current filename.
With such an algorithm you will have at the end a letter which goes first in the alphabet and list of files starting from that letter.
Sample code (tried to write in Javascript but do not blame if I wrote anything wrong):
function GetFirstLetterAndFilenames(allFilenames) {
var currentLetter = null;
var currentFilenames = null;
for (int i = 0; i < allFilenames.length ; i++) {
var thisLetter = allFilenames[i][0];
if (currentLetter == null || thisLetter < currentLetter) {
currentLetter = thisLetter;
currentFilenames = [allFilenames[i]];
} else if (currentLetter == thisLetter) {
currentFilenames.push(allFilenames[i]);
}
}
return new {lowestLetter = currentLetter, filenames = currentFilenames};
}
Names have a funny way of not distributing themselves evenly over the alphabet, so you're probably not going to win by as much as you'd hope by predicting where to look.
But a really easy way to cut your search down by an average of two steps is as follows: if the letter is from a to m, binary search for the next letter. Then binary search from the beginning of the list only to the position you just found for the next letter. If the letter is from n to z, binary search for it. Then, again, only search the portion of the list after what you just found.
Is this worth saving two steps? Dunno. It's pretty easy to implement, but then again, two steps don't take very long. (Correctly guessing the letter would save you maybe 4 steps at best.)
Another possibility is to have bins for each letter to begin with. It starts out already sorted, and if you have to re-sort, you only have to sort within one letter, not the whole list. The downside is that if you need to manipulate the whole list frequently, you have to glue all the bins together.

Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.
You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.
I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.
You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.
Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.
Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.
Sounds like a fun problem, good luck!
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
parseWords(compoundWord)
# Make the graph
graph = makeGraph()
N = compoundWord.length
for index = 0 to N
graph.addVertex(i)
# Add the edges for each word
for index = 0 to N - 1
for length = 1 to min(N - index, MAX_WORD_LENGTH)
potentialWord = compoundWord.substr(index, length)
if dictionary.isElement(potentialWord)
graph.addEdge(index, index + length, 1)
# Now find a list of edges which define the shortest path
edges = graph.shortestPath(0, N)
# Change these edges back into words.
result = makeList()
for e in edges
result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
return result
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
PenIsland
KidsExchange
TherapistFinder
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.
Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:
accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes
Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:
from __future__ import with_statement
def opendict(dictionary=r"g:\words\words(3).txt"):
with open(dictionary, "r") as f:
return set(line.strip() for line in f)
if __name__ == '__main__':
s = opendict()
for word in sorted(s):
if len(word) >= 10:
for i in range(4, len(word)-4):
left, right = word[:i], word[i:]
if (left in s) and (right in s):
if right not in ('nesses', ):
print word, left, right
It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.
A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.
A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).
Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:
# python-ish pseudocode
def splitword(word):
# word is a character array indexed from 0..n-1
for i from 1 to n-1:
head = word[:i] # first i characters
tail = word[i:] # everything else
if is_word(head):
if i == n-1:
return [head] # this was the only valid word; return it as a 1-element list
else:
rest = splitword(tail)
if rest != []: # check whether we successfully split the tail into words
return [head] + rest
return [] # No successful split found, and 'word' is not a word.
Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.
Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.
This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).
We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:
ethylmethylketone
where the morphemes are:
ethyl methyl and ketone
We tackle this through automata and maximum entropy and the code is available on Sourceforge
http://www.sf.net/projects/oscar3-chem
but be warned that it will take some work.
We sometimes encounter ambiguity and are still finding a good way of reporting it.
To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.
As another example the string
weeknight
can be parsed as
wee knight
or
week night
Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).
I would use the following algorithm.
Start with the sorted list of words
to split, and a sorted list of
declined words (dictionary).
Create a result list of objects
which should store: remaining word
and list of matched words.
Fill the result list with the words
to split as remaining words.
Walk through the result array and
the dictionary concurrently --
always increasing the least of the
two, in a manner similar to the
merge algorithm. In this way you can
compare all the possible matching
pairs in one pass.
Any time you find a match, i.e. a
split words word that starts with a
dictionary word, replace the
matching dictionary word and the
remaining part in the result list.
You have to take into account
possible multiples.
Any time the remaining part is empty,
you found a final result.
Any time you don't find a match on
the "left side", in other words,
every time you increment the result
pointer because of no match, delete
the corresponding result item. This
word has no matches and can't be
split.
Once you get to the bottom of the
lists, you will have a list of
partial results. Repeat the loop
until this is empty -- go to point 4.

Resources