Creating a Non-greedy LZW algorithm - algorithm

Basically, I'm doing an IB Extended Essay for Computer Science, and was thinking of using a non-greedy implementation of the LZW algorithm. I found the following links:
https://pdfs.semanticscholar.org/4e86/59917a0cbc2ac033aced4a48948943c42246.pdf
http://theory.stanford.edu/~matias/papers/wae98.pdf
And have been operating under the assumption that the algorithm described in paper 1 and the LZW-FP in paper 2 are essentially the same. Either way, tracing the pseudocode in paper 1 has been a painful experience that has yielded nothing, and in the words of my teacher "is incredibly difficult to understand." If anyone can figure out how to trace it, or happens to have studied the algorithm before and knows how it works, that'd be a great help.

Note: I refer to what you call "paper 1" as Horspool 1995 and "paper 2" as Matias et al 1998. I only looked at the LZW algorithm in Horspool 1995, so if you were referring to the LZSS algorithm this won't help you much.
My understanding is that Horspool's algorithm is what the authors of Matias et al 1998 call "LZW-FPA", which is different from what they call "LZW-FP"; the difference has to do with the way the algorithm decides which substrings to add to the dictionary. Since "LZW-FP" adds exactly the same substrings to the dictionary as LZW would add, LZW-FP cannot produce a longer compressed sequence for any string. LZW-FPA (and Horspool's algorithm) add the successor string of the greedy match at each output cycle. That's not the same substring (because the greedy match doesn't start at the same point as it would in LZW) and therefore it is theoretically possible that it will produce a longer compressed sequence than LZW.
Horspool's algorithm is actually quite simple, but it suffers from the fact that there are several silly errors in the provided pseudo-code. Implementing the algorithm is a good way of detecting and fixing these errors; I put an annotated version of the pseudocode below.
LZW-like algorithms decompose the input into a sequence of blocks. The compressor maintains a dictionary of available blocks (with associated codewords). Initially, the dictionary contains all single-character strings. It then steps through the input, at each point finding the longest prefix at that point which is in its dictionary. Having found that block, it outputs its codeword, and adds to the dictionary the block with the next input character appended. (Since the block found was the longest prefix in the dictionary, the block plus the next character cannot be in the dictionary.) It then advances over the block, and continues at the next input point (which is just before the last character of the block it just added to the dictionary).
Horspool's modification also finds the longest prefix at each point, and also adds that prefix extended by one character into the dictionary. But it does not immediately output that block. Instead, it considers prefixes of the greedy match, and for each one works out what the next greedy match would be. That gives it a candidate extent of two blocks; it chooses the extent with the best advance. In order to avoid using up too much time in this search, the algorithm is parameterised by the number of prefixes it will test, on the assumption that much shorter prefixes are unlikely to yield longer extents. (And Horspool provides some evidence for this heuristic, although you might want to verify that with your own experimentation.)
In Horspool's pseudocode, α is what I call the "candidate match" -- that is, the greedy match found at the previous step -- and βj is the greedy successor match for the input point after the jth prefix of α. (Counting from the end, so β0 is precisely the greedy successor match of α, with the result that setting K to 0 will yield the LZW algorithm. I think Horspool mentions this fact somewhere.) L is just the length of α. The algorithm will end up using some prefix of α, possibly (usually) all of it.
Here's Horspool's pseudocode from Figure 2 with my annotations:
initialize dictionary D with all strings of length 1;
set α = the string in D that matches the first
symbol of the input;
set L = length(α);
while more than L symbols of input remain do
begin
// The new string α++head(β0) must be added to D here, rather
// than where Horspool adds it. Otherwise, it is not available for the
// search for a successor match. Of course, head(β0) is not meaningful here
// because β0 doesn't exist yet, but it's just the symbol following α in
// the input.
for j := 0 to max(L-1,K) do
// The above should be min(L - 1, K), not max.
// (Otherwise, K would be almost irrelevant.)
find βj, the longest string in D that matches
the input starting L-j symbols ahead;
add the new string α++head(β0) to D;
// See above; the new string must be added before the search
set j = value of j in range 0 to max(L-1,K)
such that L - j + length(βj) is a maximum;
// Again, min rather than max
output the index in D of the string prefix(α,j);
// Here Horspool forgets that j is the number of characters removed
// from the end of α, not the number of characters in the desired prefix.
// So j should be replaced with L - j
advance j symbols through the input;
// Again, the advance should be L - j, not j
set α = βj;
set L = length(α);
end;
output the index in D of string α;

Related

Time Complexity of Text Justification with Dynamic Programming

I've been working on a dynamic programming problem involving the justification of text. I believe that I have found a working solution, but I am confused regarding this algorithm's runtime.
The research I have done thus far has described dynamic programming solutions to this problem as O(N^2) with N as the length of the text which is being justified. To me, this feels incorrect: I can see that O(N) calls must be made because there are N suffixes to check, however, for any given prefix we will never consider placing the newline (or 'split_point') beyond the maximum line length L. Therefore, for any given piece of text, there are at most L positions to place the split point (this assumes the worst case: that each word is exactly one character long). Because of this realization, isn't this algorithm more accurately described as O(LN)?
#memoize
def justify(text, line_length):
# If the text is less than the line length, do not split
if len(' '.join(text)) < line_length:
return [], math.pow(line_length - len(' '.join(text)), 3)
best_cost, best_splits = sys.maxsize, []
# Iterate over text and consider putting split between each word
for split_point in range(1, len(text)):
length = len(' '.join(text[:split_point]))
# This split exceeded maximum line length: all future split points unacceptable
if length > line_length:
break
# Recursively compute the best split points of text after this point
future_splits, future_cost = justify(text[split_point:], line_length)
cost = math.pow(line_length - length, 3) + future_cost
if cost < best_cost:
best_cost = cost
best_splits = [split_point] + [split_point + n for n in future_splits]
return best_splits, best_cost
Thanks in advance for your help,
Ethan
First of all your implementation is going to be far, far from the theoretical efficiency that you want. You are memoizing a string of length N in your call, which means that looking for a cached copy of your data is potentially O(N). Now start making multiple cached calls and you've blown your complexity budget.
This is fixable by moving the text outside of the function call and just passing around the index of the starting position and the length L. You are also doing a join inside of your loop that is a O(L) operation. With some care you can make that a O(1) operation instead.
With that done, you would be doing O(N*L) operations. For exactly the reasons you thought.

Check if given string can be created by a set of characters cut out from magazine article

"Observe that when you cut a character out of a magazine, the character on the reverse side of the page is also removed. Give an algorithm to determine whether you can generate a given string by pasting cutouts from a given magazine. Assume that you are given a function that will identify the character and its position on the reverse side of the page for any given character position."
How can I do it?
I can do some initial pruning so that if a needed character has only one way of getting picked up, its taken initially before turning the sub-problem for dynamic technique, but what after this initial pruning?
What is the time and space complexity?
As #LiKao suggested, this can be solved using max flow. To construct the network we make two "layers" of vertices: one with all the distinct characters in the input string and one with each position on the page. Make an edge with capacity 1 from a character to a position if that position has that character on one side. Make edges of capacity 1 from each position to the sink, and make edges from the source to each character with capacity equal to the multiplicity of that character in the input string.
For example, let's say we're searching for the word "FOO" on a page with four positions:
pos 1 2 3 4
front F C O Z
back O O K Z
We then generate the following network, ignoring position 4 since it does not provide any of the required characters.
Now, we only need to determine if there is a flow from the source to the sink of length("FOO") = 3 or more.
You can use dynamic programming directly.
We are given string s with n letters. We are given a set of pieces P = {p_1, ..., p_k}. Each piece has one letter in the front p_i.f and one in the back p_i.b.
Denote with f(j, p) the function that returns true if it is feasible to create substring s_1...s_j using pieces in p \subseteq P, and false otherwise.
The following recurrence holds:
f(n, P) = f(n-1, P-p_1) | f(n-1, P-p_2) | ... | f(n-1, P-p_k)
In plain English the feasibility of s using all pieces in P, depends on the feasibility of the substring s_1...s_n-1 given one less piece, and we try removing all possible pieces (of course in practice we do not have to remove all pieces one by one; we only need to remove those pieces for which p_i.f == s_n || p_i.b == s_n).
The initial condition is that f(1, P-p_1) = f(1, P-p2) = ... = true, assuming that we have already checked a-priori (in linear time) that there are enough letters in P to cover all the letters in s.
While this problem can be formulated as a Maxflow problem as shown in the accepted answer, it is simpler and more efficient to formulate it as a maximum cardinality matching problem in a bipartite graph. Maxflow algorithms like Dinic's are slower than the special case algorithms like Hopcroft–Karp algorithm.
The bipartite graph is formed by adding two edges from every character of the given string to a cutout, one edge for each side. We then run Hopcroft–Karp. In the end, we simply check whether the cardinality of the matching is equal to the length of the string.
For a working implementation (in Scala) using JGraphT, see my GitHub.
I'd like to come up with a more efficient DP solution, since Skiena's book has this problem in the DP section, but so far haven't found any.

Tokenize valid words from a long string

Suppose you have a dictionary that contains valid words.
Given an input string with all spaces removed, determine whether the string is composed of valid words or not.
You can assume the dictionary is a hashtable that provides O(1) lookup.
Some examples:
helloworld-> hello world (valid)
isitniceinhere-> is it nice in here (valid)
zxyy-> invalid
If a string has multiple possible parsings, just return true is sufficient.
The string can be very long. Hence think an algorithm that is both space & time efficient.
I think the set of all strings that occur as the concatenation of valid words (words taken from a finite dictionary) form a regular language over the alphabet of characters. You can then build a finite automaton that accepts exactly the strings you want; computation time is O(n).
For instance, let the dictionary consist of the words {bat, bag}. Then we construct the following automaton: states are denoted by 0, 1, 2. Edges: (0,1,b), (1,2,a), (2,0,t), (2,0,g); where the triple (x,y,z) means an edge leading from x to y on input z. The only accepting state is 0. In each step, on reading the next input sign, you have to calculate the set of states that are reachable on that input. Given that the number of states in the automaton is constant, this is of complexity O(n). As for space complexity, I think you can do with O(number of words) with the hint for construction above.
For an other example, with the words {bag, bat, bun, but} the automaton would look like this:
Supposing that the automaton has already been built (the time to do this has something to do with the length and number of words :-) we now argue that the time to decide whether a string is accepted by the automaton is O(n) where n is the length of the input string.
More formally, our algorithm is as follows:
Let S be a set of states, initially containing the starting state.
Read the next input character, let us denote it by a.
For each element s in S, determine the state that we move into from s on reading a; that is, the state r such that with the notation above (s,r,a) is an edge. Let us denote the set of these states by R. That is, R = {r | s in S, (s,r,a) is an edge}.
(If R is empty, the string is not accepted and the algorithm halts.)
If there are no more input symbols, check whether any of the accepting states is in R. (In our case, there is only one accepting state, the starting state.) If so, the string is accepted, if not, the string is not accepted.
Otherwise, take S := R and go to 2.
Now, there are as many executions of this cycle as there are input symbols. The only thing we have to examine is that steps 3 and 5 take constant time. Given that the size of S and R is not greater than the number of states in the automaton, which is constant and that we can store edges in a way such that lookup time is constant, this follows. (Note that we of course lose multiple 'parsings', but that was not a requirement either.)
I think this is actually called the membership problem for regular languages, but I couldn't find a proper online reference.
I'd go for a recursive algorithm with implicit backtracking. Function signature: f: input -> result, with input being the string, result either true or false depending if the entire string can be tokenized correctly.
Works like this:
If input is the empty string, return true.
Look at the length-one prefix of input (i.e., the first character). If it is in the dictionary, run f on the suffix of input. If that returns true, return true as well.
If the length-one prefix from the previous step is not in the dictionary, or the invocation of f in the previous step returned false, make the prefix longer by one and repeat at step 2. If the prefix cannot be made any longer (already at the end of the string), return false.
Rinse and repeat.
For dictionaries with low to moderate amount of ambiguous prefixes, this should fetch a pretty good running time in practice (O(n) in the average case, I'd say), though in theory, pathological cases with O(2^n) complexity can probably be constructed. However, I doubt we can do any better since we need backtracking anyways, so the "instinctive" O(n) approach using a conventional pre-computed lexer is out of the question. ...I think.
EDIT: the estimate for the average-case complexity is likely incorrect, see my comment.
Space complexity would be only stack space, so O(n) even in the worst-case.

Algorithm to find lenth of longest sequence of blanks in a given string

Looking for an algorithm to find the length of longest sequence of blanks in a given string examining as few characters as possible?
Hint : Your program should become faster as the length of the sequence of blanks increases.
I know the solution which is O(n).. But looking for more optimal solution
You won't be able to find a solution which is a smaller complexity than O(n) because you need to pass through every character in the worst case with an input string that has at most 0 or 1 consecutive whitespace, or is completely whitespace.
You can do some optimizations though, but it'll still be considered O(n).
For example:
Let M be the current longest match so far as you go through your list. Also assume you can access input elements in O(1), for example you have an array as input.
When you see a non-whitespace you can skip M elements if the current + M is non whitespace. Surely no whitespace longer than M can be contained inside.
And when you see a whitepsace character, if current + M-1 is not whitespace you know you don't have the longest runs o you can skip in that case as well.
But in the worst case (when all characters are blank) you have to examine every character. So it can't be better than O(n) in complexity.
Rationale: assume the whole string is blank, you haven't examined N characters and your algorithms outputs n. Then if any non-examined character is not blank, your answer would be wrong. So for this particular input you have to examine the whole string.
There's no way to make it faster than O(N) in the worst case. However, here are a few optimizations, assuming 0-based indexing.
If you already have a complete sequence of L blanks (by complete I mean a sequence that is not a subsequence of a larger sequence), and L is at least as large as half the size of your string, you can stop.
If you have a complete sequence of L blanks, once you hit a space at position i check if the character at position i + L is also a space. If it is, continue scanning from position i forwards as you might find a larger sequence - however, if you encounter a non-space until position i + L, then you can skip directly to i + L + 1. If it isn't a space, there's no way you can build a larger sequence starting at i, so scan forwards starting from i + L + 1.
If you have a complete sequence of blanks of length L, and you are at position i and you have k positions left to examine, and k <= L, you can stop your search, as obviously there's no way you'll be able to find anything better anymore.
To prove that you can't make it faster than O(N), consider a string that contains no spaces. You will have to access each character once, so it's O(N). Same with a string that contains nothing but spaces.
The obvious idea: you can jump by K+1 places (where K is the current longest space sequence) and scan back if you found a space.
This way you have something about (n + n/M)/2 = n(M+1)/2M positions checked.
Edit:
Another idea would be to apply a kind of binary search. This is like follows: for a given k you make a procedure that checks whether there is a sequence of spaces with length >= k. This can be achieved in O(n/k) steps. Then, you try to find the maximal k with binary search.
Edit:
During the consequent searches, you can utilize the knowledge that the sequence of some length k already exist, and start skipping at k from the very beginning.
What ever you do, the worst case will always be o(n) - if those blanks are on the last part of the string... (or the last "checked" part of the string).

Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words.
e.g. "Spicejet" is a combination of the words "spice" and "jet"
I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.
What would be the most efficient by which I can separate individual English words from such compound strings.
I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.
You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.
I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.
You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.
Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.
Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.
Sounds like a fun problem, good luck!
If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.
The pseudo code is something like this:
parseWords(compoundWord)
# Make the graph
graph = makeGraph()
N = compoundWord.length
for index = 0 to N
graph.addVertex(i)
# Add the edges for each word
for index = 0 to N - 1
for length = 1 to min(N - index, MAX_WORD_LENGTH)
potentialWord = compoundWord.substr(index, length)
if dictionary.isElement(potentialWord)
graph.addEdge(index, index + length, 1)
# Now find a list of edges which define the shortest path
edges = graph.shortestPath(0, N)
# Change these edges back into words.
result = makeList()
for e in edges
result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
return result
I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.
You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.
And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.
Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):
PenIsland
KidsExchange
TherapistFinder
The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.
So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.
Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:
accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes
Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:
from __future__ import with_statement
def opendict(dictionary=r"g:\words\words(3).txt"):
with open(dictionary, "r") as f:
return set(line.strip() for line in f)
if __name__ == '__main__':
s = opendict()
for word in sorted(s):
if len(word) >= 10:
for i in range(4, len(word)-4):
left, right = word[:i], word[i:]
if (left in s) and (right in s):
if right not in ('nesses', ):
print word, left, right
It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.
A Trie already stores words as compound words. So "spicejet" would be stored as "spicejet" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).
It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:
'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'
... 26 substrings.
So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.
A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).
Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:
# python-ish pseudocode
def splitword(word):
# word is a character array indexed from 0..n-1
for i from 1 to n-1:
head = word[:i] # first i characters
tail = word[i:] # everything else
if is_word(head):
if i == n-1:
return [head] # this was the only valid word; return it as a 1-element list
else:
rest = splitword(tail)
if rest != []: # check whether we successfully split the tail into words
return [head] + rest
return [] # No successful split found, and 'word' is not a word.
Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.
Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.
This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).
We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:
ethylmethylketone
where the morphemes are:
ethyl methyl and ketone
We tackle this through automata and maximum entropy and the code is available on Sourceforge
http://www.sf.net/projects/oscar3-chem
but be warned that it will take some work.
We sometimes encounter ambiguity and are still finding a good way of reporting it.
To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.
As another example the string
weeknight
can be parsed as
wee knight
or
week night
Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).
I would use the following algorithm.
Start with the sorted list of words
to split, and a sorted list of
declined words (dictionary).
Create a result list of objects
which should store: remaining word
and list of matched words.
Fill the result list with the words
to split as remaining words.
Walk through the result array and
the dictionary concurrently --
always increasing the least of the
two, in a manner similar to the
merge algorithm. In this way you can
compare all the possible matching
pairs in one pass.
Any time you find a match, i.e. a
split words word that starts with a
dictionary word, replace the
matching dictionary word and the
remaining part in the result list.
You have to take into account
possible multiples.
Any time the remaining part is empty,
you found a final result.
Any time you don't find a match on
the "left side", in other words,
every time you increment the result
pointer because of no match, delete
the corresponding result item. This
word has no matches and can't be
split.
Once you get to the bottom of the
lists, you will have a list of
partial results. Repeat the loop
until this is empty -- go to point 4.

Resources