Tokenize valid words from a long string - algorithm

Suppose you have a dictionary that contains valid words.
Given an input string with all spaces removed, determine whether the string is composed of valid words or not.
You can assume the dictionary is a hashtable that provides O(1) lookup.
Some examples:
helloworld-> hello world (valid)
isitniceinhere-> is it nice in here (valid)
zxyy-> invalid
If a string has multiple possible parsings, just return true is sufficient.
The string can be very long. Hence think an algorithm that is both space & time efficient.

I think the set of all strings that occur as the concatenation of valid words (words taken from a finite dictionary) form a regular language over the alphabet of characters. You can then build a finite automaton that accepts exactly the strings you want; computation time is O(n).
For instance, let the dictionary consist of the words {bat, bag}. Then we construct the following automaton: states are denoted by 0, 1, 2. Edges: (0,1,b), (1,2,a), (2,0,t), (2,0,g); where the triple (x,y,z) means an edge leading from x to y on input z. The only accepting state is 0. In each step, on reading the next input sign, you have to calculate the set of states that are reachable on that input. Given that the number of states in the automaton is constant, this is of complexity O(n). As for space complexity, I think you can do with O(number of words) with the hint for construction above.
For an other example, with the words {bag, bat, bun, but} the automaton would look like this:
Supposing that the automaton has already been built (the time to do this has something to do with the length and number of words :-) we now argue that the time to decide whether a string is accepted by the automaton is O(n) where n is the length of the input string.
More formally, our algorithm is as follows:
Let S be a set of states, initially containing the starting state.
Read the next input character, let us denote it by a.
For each element s in S, determine the state that we move into from s on reading a; that is, the state r such that with the notation above (s,r,a) is an edge. Let us denote the set of these states by R. That is, R = {r | s in S, (s,r,a) is an edge}.
(If R is empty, the string is not accepted and the algorithm halts.)
If there are no more input symbols, check whether any of the accepting states is in R. (In our case, there is only one accepting state, the starting state.) If so, the string is accepted, if not, the string is not accepted.
Otherwise, take S := R and go to 2.
Now, there are as many executions of this cycle as there are input symbols. The only thing we have to examine is that steps 3 and 5 take constant time. Given that the size of S and R is not greater than the number of states in the automaton, which is constant and that we can store edges in a way such that lookup time is constant, this follows. (Note that we of course lose multiple 'parsings', but that was not a requirement either.)
I think this is actually called the membership problem for regular languages, but I couldn't find a proper online reference.

I'd go for a recursive algorithm with implicit backtracking. Function signature: f: input -> result, with input being the string, result either true or false depending if the entire string can be tokenized correctly.
Works like this:
If input is the empty string, return true.
Look at the length-one prefix of input (i.e., the first character). If it is in the dictionary, run f on the suffix of input. If that returns true, return true as well.
If the length-one prefix from the previous step is not in the dictionary, or the invocation of f in the previous step returned false, make the prefix longer by one and repeat at step 2. If the prefix cannot be made any longer (already at the end of the string), return false.
Rinse and repeat.
For dictionaries with low to moderate amount of ambiguous prefixes, this should fetch a pretty good running time in practice (O(n) in the average case, I'd say), though in theory, pathological cases with O(2^n) complexity can probably be constructed. However, I doubt we can do any better since we need backtracking anyways, so the "instinctive" O(n) approach using a conventional pre-computed lexer is out of the question. ...I think.
EDIT: the estimate for the average-case complexity is likely incorrect, see my comment.
Space complexity would be only stack space, so O(n) even in the worst-case.

Related

Creating a Non-greedy LZW algorithm

Basically, I'm doing an IB Extended Essay for Computer Science, and was thinking of using a non-greedy implementation of the LZW algorithm. I found the following links:
https://pdfs.semanticscholar.org/4e86/59917a0cbc2ac033aced4a48948943c42246.pdf
http://theory.stanford.edu/~matias/papers/wae98.pdf
And have been operating under the assumption that the algorithm described in paper 1 and the LZW-FP in paper 2 are essentially the same. Either way, tracing the pseudocode in paper 1 has been a painful experience that has yielded nothing, and in the words of my teacher "is incredibly difficult to understand." If anyone can figure out how to trace it, or happens to have studied the algorithm before and knows how it works, that'd be a great help.
Note: I refer to what you call "paper 1" as Horspool 1995 and "paper 2" as Matias et al 1998. I only looked at the LZW algorithm in Horspool 1995, so if you were referring to the LZSS algorithm this won't help you much.
My understanding is that Horspool's algorithm is what the authors of Matias et al 1998 call "LZW-FPA", which is different from what they call "LZW-FP"; the difference has to do with the way the algorithm decides which substrings to add to the dictionary. Since "LZW-FP" adds exactly the same substrings to the dictionary as LZW would add, LZW-FP cannot produce a longer compressed sequence for any string. LZW-FPA (and Horspool's algorithm) add the successor string of the greedy match at each output cycle. That's not the same substring (because the greedy match doesn't start at the same point as it would in LZW) and therefore it is theoretically possible that it will produce a longer compressed sequence than LZW.
Horspool's algorithm is actually quite simple, but it suffers from the fact that there are several silly errors in the provided pseudo-code. Implementing the algorithm is a good way of detecting and fixing these errors; I put an annotated version of the pseudocode below.
LZW-like algorithms decompose the input into a sequence of blocks. The compressor maintains a dictionary of available blocks (with associated codewords). Initially, the dictionary contains all single-character strings. It then steps through the input, at each point finding the longest prefix at that point which is in its dictionary. Having found that block, it outputs its codeword, and adds to the dictionary the block with the next input character appended. (Since the block found was the longest prefix in the dictionary, the block plus the next character cannot be in the dictionary.) It then advances over the block, and continues at the next input point (which is just before the last character of the block it just added to the dictionary).
Horspool's modification also finds the longest prefix at each point, and also adds that prefix extended by one character into the dictionary. But it does not immediately output that block. Instead, it considers prefixes of the greedy match, and for each one works out what the next greedy match would be. That gives it a candidate extent of two blocks; it chooses the extent with the best advance. In order to avoid using up too much time in this search, the algorithm is parameterised by the number of prefixes it will test, on the assumption that much shorter prefixes are unlikely to yield longer extents. (And Horspool provides some evidence for this heuristic, although you might want to verify that with your own experimentation.)
In Horspool's pseudocode, α is what I call the "candidate match" -- that is, the greedy match found at the previous step -- and βj is the greedy successor match for the input point after the jth prefix of α. (Counting from the end, so β0 is precisely the greedy successor match of α, with the result that setting K to 0 will yield the LZW algorithm. I think Horspool mentions this fact somewhere.) L is just the length of α. The algorithm will end up using some prefix of α, possibly (usually) all of it.
Here's Horspool's pseudocode from Figure 2 with my annotations:
initialize dictionary D with all strings of length 1;
set α = the string in D that matches the first
symbol of the input;
set L = length(α);
while more than L symbols of input remain do
begin
// The new string α++head(β0) must be added to D here, rather
// than where Horspool adds it. Otherwise, it is not available for the
// search for a successor match. Of course, head(β0) is not meaningful here
// because β0 doesn't exist yet, but it's just the symbol following α in
// the input.
for j := 0 to max(L-1,K) do
// The above should be min(L - 1, K), not max.
// (Otherwise, K would be almost irrelevant.)
find βj, the longest string in D that matches
the input starting L-j symbols ahead;
add the new string α++head(β0) to D;
// See above; the new string must be added before the search
set j = value of j in range 0 to max(L-1,K)
such that L - j + length(βj) is a maximum;
// Again, min rather than max
output the index in D of the string prefix(α,j);
// Here Horspool forgets that j is the number of characters removed
// from the end of α, not the number of characters in the desired prefix.
// So j should be replaced with L - j
advance j symbols through the input;
// Again, the advance should be L - j, not j
set α = βj;
set L = length(α);
end;
output the index in D of string α;

a puzzle about definition of the time complexity

Wikipedia defines time complexity as
In computer science, the time complexity of an algorithm quantifies
the amount of time taken by an algorithm to run as a function of the
length of the string representing the input.
What's mean of the strong part?
I know algorithm may be treated as a function but why its input must be "the length of the string representing"?
The function in the bolder part means the time complexity of the algorithm, not the algorithm itself. An algorithm may be implemented in a programming language that has a function keyword, but that's something else.
Algorithm MergeSort has as input a list of 32m bits (assuming m 32-bit values). It's time complexity T(n) is a function of n = 32m, the input size, and in the worst case is bound from above by O(n log n). MergeSort could be implemented as a function in C or JavaScript.
The defition is derived from the context of Turing machines where you define different states. Every function which you can compute with a computer is also computeable with turing machine.(i would say that computer computes a function on the basis of turing machine)
Every fucntion is just a mapping from one domain to other domain or same domain.
Before going to Turing machines look at the concept of finite-automata.It has finite states.If your input is of length n then it's possible that it only needs two stats but it has to visit those states n times where n is legnth of the string.
Not a good sketch but look at the image below ,our final state is C , means if
end with a string in c our string will be accepted.
We use unary numberal system. We want to check if this this string gets accepted by our automata: string is 010101010
When we read 0 from A we move to B and if we read again a 0 we will move to C and if we end with 0 our string gets accepted otherwise we move to A again.
In computer you represent numbers as strings with length n and in order to compute it you have to visit each character of the string.
Turing machines work on the same way but finite-automata only is limited to regular languages. How this is a big theory
Did you ever try to think how computer computes a function 2*x where x is your input.
It's fun :D . Suppose i want to compute 20*2 and i represent this number with unary numeral system because it's easy. so we represetn 0 with , 1 with 11 , 2 with 111 and etc so if we convert 20 to unary sytem we get 1111. You can think of a turing machine or computer(not advanced) , a system with linear memory.
Suppose empty spots in your memory are presented with # .
With you input you have something like this: ###1111#### where # means empty slot in memory, with your input head of your turing machine is at first 1 so you keep moving forward until you find first # once you find this you just replace it with * which is just a helping symbol and change the right side of # with 1 now move back and change one more 1 to * and write one 1 on the right side when you find a # keep doing this and you will be left with all * on the left hand side and all 1s on the right side, Now change all *s back to 1 and you have 2*x. Here is the trace and you have 2*x where x was your input.
The point is that the only thing these machines remember is the state.
#####1111#######
#####111*1######
#####11**11#####
#####1***111####
#####****1111###
#####11111111###
Summary
If there is some input, it is expressed as a string. So you have a length of that string. Then you have a function F that maps the length of the input (as a string) to the time needed by A to compute this input (in the worst case).
We call this F time complexity.
Say we have an algorithm A. What is its time complexity?
The very easy case
If A has constant complexity, the input doesn't matter. The input could be a single value, or a list, or a map from strings to lists of lists. The algorithm will run for the same amount of time. 3 seconds or 1000 ticks or a million years or whatever. Constant time values not depending on the input.
Not much complexity at all to be honest.
Adding complexity
Now let's say for example A is an algorithm for sorting list of integer numbers. It's clear that the time needed by A now depends on the length of the list. A list of length 0 is literally sorted in no time (but checking the length of the list) but this changes if the length of the input list grows.
You could say there exists a function F that maps the list length to the seconds needed by A to sort a list of that length. But wait! What if the list is already sorted? So for simplicity let's always assume a worst case scenario: F maps list length to the maximum of seconds needed by A to sort a list of that length.
You could measure in seconds, CPU cycles, ticks, or whatever. It doesn't depend on the units.
Generalizing a bit
What with all the other algorithms? How to measure time complexity for an algorithm that cooks me a nice meal?
If you cannot define any input parameter then we're back in the easy case: constant time. If there is some input it is expressed as a string. So you have a length of that string. And - similar to what has been said above - then you have a function F that maps the length of the input (as a string) to the time needed by A to compute this input (in the worst case).
We call this F time complexity.
That's too simple
Yeah, I know. There is the average case and the best case, there is the big O notation and asymptotic complexity. But for explaining the bold part in the original question this is sufficient, I think.

Calculating the hash of any substring in logarithmic time

Question came up in relation to this article:
https://threads-iiith.quora.com/String-Hashing-for-competitive-programming
The author presents this algorithm for hashing a string:
where S is our string, Si is the character at index i, and p is a prime number we've chosen.
He then presents the problem of determining whether a substring of a given string is a palindrome and claims it can be done in logarithmic time through hashing.
He makes the point we can calculate from the beginning of our whole string to the right edge of our substring:
and observes that if we calculate the hash from the beginning to the left edge of our substring (F(L-1)), the difference between this and our hash to our right edge is basically the hash of our substring:
This is all fine, and I think I follow it so far. But he then immediately makes the claim that this allows us to calculate our hash (and thus determine if our substring is a palindrome by comparing this hash with the one generated by moving through our substring in reverse order) in logarithmic time.
I feel like I'm probably missing something obvious but how does this allow us to calculate the hash in logarithmic time?
You already know that you can calculate the difference in constant time. Let me restate the difference (I'll leave the modulo away for clarity):
diff = ∑_{i=L to R} S_i ∗ p^i
Note that this is not the hash of the substring because the powers of p are offset by a constant. Instead, this is (as stated in the article)
diff = Hash(S[L,R])∗p^L
To derive the hash of the substring, you have to multiply the difference with p^-L. Assuming that you already know p^-1 (this can be done in a preprocessing step), you need to calculate (p^-1)^L. With the square-and-multiply method, this takes O(log L) operations, which is probably what the author refers to.
This may become more efficient if your queries are sorted by L. In this case, you could calculate p^-L incrementally.

Complexity of binary search on a string

I have an sorted array of strings: eg: ["bar", "foo", "top", "zebra"] and I want to search if an input word is present in an array or not.
eg:
search (String[] str, String word) {
// binary search implemented + string comaparison.
}
Now binary search will account for complexity which is O(logn), where n is the length of an array. So for so good.
But, at some point we need to do a string compare, which can be done in linear time.
Now the input array can contain of words of different sizes. So when I
am calculating final complexity will the final answer be O(m*logn)
where m is the size of word we want to search in the array, which in our case
is "zebra" the word we want to search?
Yes, your thinking as well your proposed solution, both are correct. You need to consider the length of the longest String too in the overall complexity of String searching.
A trivial String compare is an O(m) operation, where m is the length of the larger of the two strings.
But, we can improve a lot, given that the array is sorted. As user "doynax" suggests,
Complexity can be improved by keeping track of how many characters got matched during
the string comparisons, and store the present count for the lower and
upper bounds during the search. Since the array is sorted we know that
the prefix of the middle entry to be tested next must match up to at
least the minimum of the two depths, and therefore we can skip
comparing that prefix. In effect we're always either making progress
or stopping the incremental comparisons immediately on a mismatch, and
thereby never needing to keep going over old ground.
So, overall m number of character comparisons would have to be done till the end of the string, if found OR else not even that much(if fails at early stage).
So, the overall complexity would be O(m + log n).
I was under the impression that what original poster said was correct by saying time complexity is O(m*logn).
If you use the suggested enhancement to improve the time complexity (to get O(m + logn)) by tracking previously matched letters I believe the below inputs would break it.
arr = [“abc”, “def”, “ghi”, “nlj”, “pfypfy”, “xyz”]
target = “nljpfy”
I expect this would incorrectly match on “pfypfy”. Perhaps one of the original posters can weigh in on this. Definitely curious to better understand what was proposed. It sounds like matched number of letters are skipped in next comparison.

Anagram generation - Isnt it kind of subset sum?

Anagram:
An anagram is a type of word play, the result of rearranging the
letters of a word or phrase to produce a new word or phrase, using
all the original letters exactly once;
Subset Sum problem:
The problem is this: given a set of integers, is there a non-empty
subset whose sum is zero?
For example, given the set { −7, −3, −2, 5, 8}, the answer is yes
because the subset { −3, −2, 5} sums to zero. The problem is
NP-complete.
Now say we have a dictionary of n words. Now Anagram Generation problem can be stated as to find a set of words in dictionary(of n words) which use up all letters of the input. So does'nt it becomes a kind of subset sum problem.
Am I wrong?
The two problems are similar but are not isomorphic.
In an anagram the order of the letters matters. In a subset sum, the order does not matter.
In an anagram, all the letters must be used. In a subset sum, any subset will do.
In an anagram, the subgroups must form words taken from a comparatively small dictionary of allowable words (the dictionary). In a subset sum, the groups are unrestricted (no dictionary of allowable groupings).
If you'd prove that solving anagram finding (not more than polynomial number of times) solves subset sum problem - it would be a revolution in computer science (you'd prove P=NP).
Clearly finding anagrams is polynomial-time problem:
Checking if two records are anagrams of each other is as simple as sorting letters and compare the resulting strings (that is C*s*log(s) time, where s - number of letters in a record). You'll have at most n such checks, where n - number of records in a dictionary. So obviously the running time ~ C*s*log(s)*n is limited by a polynomial of input size - your input record and dictionary combined.
EDIT:
All the above is valid only if the anagram finding problem is defined as finding anagram of the input phrase in a dictionary of possible complete phrases.
While the wording of the anagram finding problem in the original question above...
Now say we have a dictionary of n words. Now Anagram Generation problem can be stated as to find a set of words in dictionary(of n words) which use up all letters of the input.
...seems to imply something different - e.g. a possibility that some sort of composition of more than one entry in a dictionary is also a valid choice for a possible anagram of the input.
This however seems immediately problematic and unclear because (1) usually phrase is not just sequence of random words (it should make sense as a whole phrase), (2) usually words in a phrase require separators that are also symbols - so it is not clear if the separators (whitespace characters) are required in the input to allow the separate entries in a dictionary and if separators are allowed in a single dictionary entry.
So in my initial answer above I applied a "semantic razor" by interpreting the problem definition the only way it is unambiguous and makes sense as an "anagram finding".
But also we might interpret the authors definition like this:
Given the dictionary of n letter sequences (separate dictionary entries may contain same sequences) and one target letter sequence - find any subset of the dictionary entries that if concatenated together would be exact rearrangement of the target letter sequence OR determine that such subset does not exist.
^^^- Even though this problem no longer really makes perfect sense as an "anagram finding problem" still it is interesting. It is very different problem to what I considered above.
One more thing remains unclear - the alphabet flexibility. To be specific the problem definition must also specify whether set of letters is fixed OR it is allowed to redefine it for each new solution of the problem when specifying dictionary and target sequence of said letters. That's important - capabilities and complexity depends on that.
The variant of this problem with the ability to define the alphabet (available number of letters) for each solution individually actually is equivalent to a subset sum problem. That makes it NP-complete.
I can prove the equivalence of our problem to a natural number variant of subset sum problem defined as
Given the collection (multiset) of natural numbers (repeated numbers allowed) and the target natural number - find any sub-collection that sums exactly to the target number OR determine that such sub-collection does not exist.
It is not hard to see that mostly linear number of steps is enough to translate one problem input to another and vice versa. So the solution of one problem translates to exactly one solution of another problem plus mostly linear overhead.
This positive-only variant of subset-sum is equivalent to zero-sum subset-sum variant given by the author (see e.g. Subset Sum Wikipedia article).
I think you are wrong.
Anagram Generation must be simpler than Subset Sum, because I can devise a trivial O(n) algorithm to solve it (as defined):
initialize the list of anagrams to an empty list
iterate the dictionary word by word
if all the input letters are used in the ith word
add the word to the list of anagrams
return the list of anagrams
Also, anagrams consist of valid words that are permutations of the input word (i.e. rearrangements) whereas subsets have no concept of order. They may actually include less elements than the input set (hence sub set) but an anagram must always be the same length as the input word.
It isn't NP-Complete because given a single set of letters, the set of anagrams remains identical regardless.
There is always a single mapping that transforms the letters of the input L to a set of anagrams A. so we can say that f(L) = A for any execution of f. I believe, if I understand correctly, that this makes the function deterministic. The order of a Set is irrelevant, so considering a differently ordered solution non-deterministic is invalid, it is also invalid because all entries in a dictionary are unique, and thus can be deterministically ordered.

Resources