Cases in strings, Pseudocode - pseudocode

In pseudocode, how do I convert and switch between upper and lower cases of letters in strings?
I am trying to convert a code that is in C sharp to pseudocode :

Because it's pseudo code, you can do it pretty much however you want.
For instance
test := lower(test)
Or
test ← test in lowercase
As long as the code is understandable and somewhat consistent, it's fine.
If you want to switch between lower and upper case in a string, making the text LoOk LiKe ThIs you could do it with the following code.
for i ← 1 to length(test) do
if i is even do
test[i] ← lower(test[i])
else do
test[i] ← upper(test[i])

Related

Creating a Non-greedy LZW algorithm

Basically, I'm doing an IB Extended Essay for Computer Science, and was thinking of using a non-greedy implementation of the LZW algorithm. I found the following links:
https://pdfs.semanticscholar.org/4e86/59917a0cbc2ac033aced4a48948943c42246.pdf
http://theory.stanford.edu/~matias/papers/wae98.pdf
And have been operating under the assumption that the algorithm described in paper 1 and the LZW-FP in paper 2 are essentially the same. Either way, tracing the pseudocode in paper 1 has been a painful experience that has yielded nothing, and in the words of my teacher "is incredibly difficult to understand." If anyone can figure out how to trace it, or happens to have studied the algorithm before and knows how it works, that'd be a great help.
Note: I refer to what you call "paper 1" as Horspool 1995 and "paper 2" as Matias et al 1998. I only looked at the LZW algorithm in Horspool 1995, so if you were referring to the LZSS algorithm this won't help you much.
My understanding is that Horspool's algorithm is what the authors of Matias et al 1998 call "LZW-FPA", which is different from what they call "LZW-FP"; the difference has to do with the way the algorithm decides which substrings to add to the dictionary. Since "LZW-FP" adds exactly the same substrings to the dictionary as LZW would add, LZW-FP cannot produce a longer compressed sequence for any string. LZW-FPA (and Horspool's algorithm) add the successor string of the greedy match at each output cycle. That's not the same substring (because the greedy match doesn't start at the same point as it would in LZW) and therefore it is theoretically possible that it will produce a longer compressed sequence than LZW.
Horspool's algorithm is actually quite simple, but it suffers from the fact that there are several silly errors in the provided pseudo-code. Implementing the algorithm is a good way of detecting and fixing these errors; I put an annotated version of the pseudocode below.
LZW-like algorithms decompose the input into a sequence of blocks. The compressor maintains a dictionary of available blocks (with associated codewords). Initially, the dictionary contains all single-character strings. It then steps through the input, at each point finding the longest prefix at that point which is in its dictionary. Having found that block, it outputs its codeword, and adds to the dictionary the block with the next input character appended. (Since the block found was the longest prefix in the dictionary, the block plus the next character cannot be in the dictionary.) It then advances over the block, and continues at the next input point (which is just before the last character of the block it just added to the dictionary).
Horspool's modification also finds the longest prefix at each point, and also adds that prefix extended by one character into the dictionary. But it does not immediately output that block. Instead, it considers prefixes of the greedy match, and for each one works out what the next greedy match would be. That gives it a candidate extent of two blocks; it chooses the extent with the best advance. In order to avoid using up too much time in this search, the algorithm is parameterised by the number of prefixes it will test, on the assumption that much shorter prefixes are unlikely to yield longer extents. (And Horspool provides some evidence for this heuristic, although you might want to verify that with your own experimentation.)
In Horspool's pseudocode, α is what I call the "candidate match" -- that is, the greedy match found at the previous step -- and βj is the greedy successor match for the input point after the jth prefix of α. (Counting from the end, so β0 is precisely the greedy successor match of α, with the result that setting K to 0 will yield the LZW algorithm. I think Horspool mentions this fact somewhere.) L is just the length of α. The algorithm will end up using some prefix of α, possibly (usually) all of it.
Here's Horspool's pseudocode from Figure 2 with my annotations:
initialize dictionary D with all strings of length 1;
set α = the string in D that matches the first
symbol of the input;
set L = length(α);
while more than L symbols of input remain do
begin
// The new string α++head(β0) must be added to D here, rather
// than where Horspool adds it. Otherwise, it is not available for the
// search for a successor match. Of course, head(β0) is not meaningful here
// because β0 doesn't exist yet, but it's just the symbol following α in
// the input.
for j := 0 to max(L-1,K) do
// The above should be min(L - 1, K), not max.
// (Otherwise, K would be almost irrelevant.)
find βj, the longest string in D that matches
the input starting L-j symbols ahead;
add the new string α++head(β0) to D;
// See above; the new string must be added before the search
set j = value of j in range 0 to max(L-1,K)
such that L - j + length(βj) is a maximum;
// Again, min rather than max
output the index in D of the string prefix(α,j);
// Here Horspool forgets that j is the number of characters removed
// from the end of α, not the number of characters in the desired prefix.
// So j should be replaced with L - j
advance j symbols through the input;
// Again, the advance should be L - j, not j
set α = βj;
set L = length(α);
end;
output the index in D of string α;

what is the right algorithm for ... eh, I don't know what it's called

Suppose I have a long and irregular digital signal made up of smaller but irregular signals occurring at various times (and overlapping each other). We will call these shorter signals the "pieces" that make up the larger signal. By "irregular" I mean that it is not a specific frequency or pattern.
Given the long signal I need to find the optimal arrangement of pieces that produce (as closely as possible) the larger signal. I know what the pieces look like but I don't know how many of them exist in the full signal (or how many times any one piece exists in the full signal). What software algorithm would you use to do this optimization? What do I search for on the web to get help on solving this problem?
Here's a stab at it.
This is actually the easier of the deconvolution problems. It is easier in that you may be able to have a unique answer. The harder problem is that you also don't know what the pieces look like. That case is called blind deconvolution. It is a harder problem and is usually iterative and statistical (ML or MAP), and the solution may not be right.
Luckily, your case is easier, but still not so easy because you have multiple pieces :p
I think that it may be commonly called mixture deconvolution?
So let f[t] for t=1,...N be your long signal. Let h1[t]...hn[t] for t=0,1,2,...M be your short signals. Obviously here, N>>M.
So your hypothesis is that:
(1) f[t] = h1[t+a1[1]]+h1[t+a1[2]] + ...
+h2[t+a2[1]]+h2[t+a2[2]] + ...
+....
+hn[t+an[1]]+h2[t+an[2]] + ...
Observe that each row of that equation is actually hj * uj where uj is the sum of shifted Kronecker delta. The * here is convolution.
So now what?
Let Hj be the (maybe transposed depending on how you look at it) Toeplitz matrix generated by hj, then the equation above becomes:
(2) F = H1 U1 + H2 U2 + ... Hn Un
subject to the constraint that uj[k] must be either 0 or 1.
where F is the vector [f[0],...F[N]] and Uj is the vector [uj[0],...uj[N]].
So you can rewrite this as:
(3) F = H * U
where H = [H1 ... Hn] (horizontal concatenation) and U = [U1; ... ;Un] (vertical concatenation).
H is an Nx(nN) matrix. U is an nN vector.
Ok, so the solution space is finite. It is 2^(nN) in size. So you can try all possible combinations to see which one gives you the lowest ||F - H*U||, but that will take too long.
What you can do is solve equation (3) using pseudo-inverse, multi-linear regression (which uses least square, which comes out to pseudo-inverse), or something like this
Is it possible to solve a non-square under/over constrained matrix using Accelerate/LAPACK?
Then move that solution around within the null space of H to get a solution subject to the constraint that uj[k] must be either 0 or 1.
Alternatively, you can use something like Nelder-Mead or Levenberg-Marquardt to find the minimum of:
||F - H U|| + lambda g(U)
where g is a regularization function defined as:
g(U) = ||U - U*||
where U*[j] = 0 if |U[j]|<|U[j]-1|, else 1
Ok, so I have no idea if this will converge. If not, you have to come up with your own regularizer. It's kinda dumb to use a generalized nonlinear optimizer when you have a set of linear equations.
In reality, you're going to have noise and what not, so it actually may not be a bad idea to use something like MAP and apply the small pieces as prior.

Calculate the sum of a series

I need to calculate the sum of an infinite series using mixed and recursive methods. What is the difference between these two methods?
My code, below, shows how I am doing it. Which method am I using?
For example, to compute the series
Sum = -X -(X^2/2) -(X^3/3) -(X^4/4)....etc
I would use this code
sum := -x;
numerator:= x;
n := 2;
current := -x;
repeat
numerator := numerator * x;
previous := current;
current := numerator/n;
n := n + 1;
sum := sum - current;
until ( abs(previous-current) < eps )
Your problem/question is too vague / too general. I cannot offer more, therefore, than a few general remarks:
For starters, a general method to sum "any" infinite series does not exist. For each series individually you will have to determine how to sum THAT PARTICULAR ONE and this requires, first of all, a study of its convergence-characteristics: a series may converge, diverge, or conditionally converge. Simply adding terms until a term gets smaller than some limit, or until the difference between successive terms becomes smaller than some limit is not a guarantee that you're close to the limiting sum. In fact, it doesn't even guarantee that the sum is finite at all (consider the series 1 + 1/2 + 1/3 + 1/4 ... for example).
Now, let's view your example: -sum( x^n/n; n=1..inf ).
This particular series doesn't have a finite sum for any x>=1 and neither for x<-1: it doesn't converge unless -1<=x<1, the terms get larger and larger... (however, read on!).
For abs(x)<1 a 'straightforward' approach of adding successive terms will 'in the end' give you the correct answer, but it will take a long while before you get close to the limiting sum, unless x is very small, and assessing HOW close you are with any finite sub-sum is far from trivial. Moreover, there are better (=faster-converging) methods to sum such types of series.
In this specific case, you may note that it is log(1-x), expressed in a Maclaurin series expansion, so there's not need to set up a tedious summation at all, because the result of the infinite summation is already known.
Now, consider at the one hand that we can easily see that the terms will get bigger and bigger for higher 'n' whenever abs(x) is greater than 1, so that any simple summing-procedure is bound to fail.
At the other hand we have this Maclaurin expansion for {log(1-x); -1<=x<1} and we may ponder how it all fits in with the knowledge that surely log(1-x) also exists and is finite for x=-4: could we maybe 'define' the limit of the summation also for x<-1 by this logarithm?! Enter the wonderful world of analytic continuation. I won't go into this, it would take far too much space here.
All in all, summing an infinite series is an art, not something to throw into a standard-summing-machine. Consequently, without specifying which series you wish to sum, you cannot a priori say what method one should apply.
Finally, I do not know what you mean by a "mixed method", so I cannot comment on that, or on its comparison against a recursive method. A recursive method might come in whenever you can write your series in a form that is very similar to the original, but just 'slightly simpler'. An example, not from an infinite series, but from a finite series: the Fibonacci number F(n) can be defined as the finite sum F(N-1)+F(n-2). That is recursion and you 'only' have to know some base-value(s) - i.c: F(0)=F(1)=1 - and there you have your recurrence set-up. Rewriting a series in a recursive form may help to find an analytic solution, or to split off a part that has an analytic solution leaving a 'more convenient' series that lends itself to a fast-converging numerical approach.
Maybe "mixed method" is intended to indicate a mixture of an analytical summation - as with your series: log(1-x) - and some (smart or brute-force) numerical approximation (where, as others pointed out, 'recursive' might be meant to be 'iterative').
To conclude: (a) clarify what you mean by "mixed" and "recursive" methods; (b) be specific about what type of series you need to sum, lest there's no sensible answer possible.
Some parts of an answer to your question:
I don't know what you mean by a 'mixed' method, though if you had 2 methods and made a new one out of bits of both of them then I guess I could see that you would then have a mixed method. But as a generally-used, off-the-shelf term, it's meaningless to me. Since you contrast it with 'recursive' and since I've already decided that you are not a native English speaker I wonder if you meant to write 'iterative' ? It is very common to find 'iterative' and 'recursive' methods compared and contrasted.
What you have shown us is an 'iterative' method; a simple view is that an iterative method is one which relies on a loop (or loops) as your code does.
Incidentally, I think you could make your code simpler if you recognised that the first term in your series has the same form as all the other terms, you have simplified (X^1)/1 to X which is mathematically correct, but computationally it is often more straightforward to operate on a sequence of identical terms rather than a sequence where the first term differs in form from all the rest.
A recursive method is one which calls itself. Since I suspect that I am helping you with homework I'm not going to write a recursive method for you, but you should be looking for a function which has the approximate form:
sum([]) = 0
sum([a b c d ...]) = a + sum([b c d ...])
Note that the 'function' sum (which is defined in 2 'clauses') appears on both the left-hand side and righ-hand side of the 2nd clause. Note also that on the right it is applied to a sub-set of the input arguments (on the left) which offers the possibility that at some stage the function will terminate.
The seris that you have presented
-X -(X^2/2) -(X^3/3) -(X^4/4)...
Can be writen as
-(X^1/1) - (X^2/2) - (X^3/3) ... -(X^n/n)
This provide us to
Repeate -(X^i/i) until n where n is abs(previous-current) < eps increasing i each time
This give us:
I hope that bellow code meet your expectation.
i := 1;
sum := 0;
current := x;
repeat
previous := current;
current := - exp(x, i) / i; {Here you call a function exp that realise x^i}
sum := sum + current;
i := i + 1;
until ( abs(previous-current) < eps )

Fewest toggles to create an alternating chain

I'm trying to solve this problem on SPOJ : http://www.spoj.pl/problems/EDIT/
I'm trying to get a decent recursive description of the algorithm, but I'm failing as my thoughts keep spinning in circles! Can you guys help me out with this one? I'll try to describe what approach I'm trying to solve this.
Basically I want to solve a problem of size j-i where i is the starting index and j is the ending index. Now, there should be two cases. If j-i is even then both the starting and the ending letters have to be the same case, and they have to be the opposite case when j-i is odd. I also want to reduce the problem of a lower size (j-i-1 or j-i-2), but I feel that if I know a solution to a smaller problem, then constructing a solution of a just bigger problem should also take into account the starting and ending letter cases of the smaller problem. This is exactly where I'm getting confused. Can you guys put my thoughts on the right track?
I think recursion is not the best way to go with this problem. It can be solved quite fast if we take a different approach!
Let us consider binary strings. Say an uppercase char is 1 and a lowercase one is 0. For example
AaAaB -> 10101
ABaa -> 1100
a -> 0
a "correct" alternating chain is either 10101010.. or 010101010..
We call the minimum number of substitutions required to change one string into the other the Hamming distance between the strings. What we have to find is the minimum Hamming distance between the input binary string and one of the two alternating chains of the same length.
It's not difficult: we XOR each string and then count the number of 1s. (link). For example, let's consider the following string: ABaa.
We convert it in binary:
ABaa -> 1100
We generate the only two alternating chains of length 4:
1010
0101
We XOR them with the input:
1100 XOR 1010 = 0101
1100 XOR 0101 = 1010
We count the 1s in the results and take the minimum. In this case, it's 2.
I coded this procedure in Java with some minor optimization (buffered I/O, no real need to generate the alternating chains) and it got accepted: (0.60 seconds one).
Given any string s of length n, there are only two possible "alternating chain".
This 2 variants can be defined sequentially by settings the first letter state (if first is upper then second is lower, third is upper...).
A simple linear algorithm would be to make 2 simple assumptions about the first letter:
First letter is UpperCase
First letter is LowerCase
For each assumption, run a simple edit distance algorithm and you are done.
You can do it recursively, but you'll need to pass and return a lot of state information between functions, which I think is not worthwhile when this problem can be solved by a simple loop.
As the others say, there are two possible "desired result" strings: one starts with an uppercase letter (let's call it result_U) and one starts with a lowercase letter (result_L). We want the smaller of EditDistance(input, result_U) and EditDistance(input, result_L).
Also observe that, to calculate EditDistance(input, result_U), we do not need to generate result_U, we just need to scan input 1 character at a time, and each character that is not the expected case will need 1 edit to make it the correct case, i.e. adds 1 to the edit distance. Ditto for EditDistance(input, result_L).
Also, we can combine the two loops so that we scan input only once. In fact, this can be done while reading each input string.
A naive approach would look like this:
Pseudocode:
EditDistance_U = 0
EditDistance_L = 0
Read a character
To arrive at result_U, does this character need editing?
Yes => EditDistance_U += 1
No => Do nothing
To arrive at result_L, does this character need editing?
Yes => EditDistance_L += 1
No => Do nothing
Loop until end of string
EditDistance = min(EditDistance_U, EditDistance_L)
There are obvious optimizations that can be done to the above also, but I'll leave it to you.
Hint 1: Do we really need 2 conditionals in the loop? How are they related to each other?
Hint 2: What is EditDistance_U + EditDistance_L?

Algorithm for finding basis of a set of bitstrings?

This is for a diff utility I'm writing in C++.
I have a list of n character-sets {"a", "abc", "abcde", "bcd", "de"} (taken from an alphabet of k=5 different letters). I need a way to observe that the entire list can be constructed by disjunctions of the character-sets {"a", "bc", "d", "e"}. That is, "b" and "c" are linearly dependent, and every other pair of letters is independent.
In the bit-twiddling version, the character-sets above are represented as {10000, 11100, 11111, 01110, 00011}, and I need a way to observe that they can all be constructed by ORing together bitstrings from the smaller set {10000, 01100, 00010, 00001}.
In other words, I believe I'm looking for a "discrete basis" of a set of n different bit-vectors in {0,1}k. This paper claims the general problem is NP-complete... but luckily I'm only looking for a solution to small cases (k < 32).
I can think of really stupid algorithms for generating the basis. For example: For each of the k2 pairs of letters, try to demonstrate (by an O(n) search) that they're dependent. But I really feel like there's an efficient bit-twiddling algorithm that I just haven't stumbled upon yet. Does anyone know it?
EDIT: I ended up not really needing a solution to this problem after all. But I'd still like to know if there is a simple bit-twiddling solution.
I'm thinking a disjoint set data structure, like union find turned on it's head (rather than combining nodes, we split them).
Algorithm:
Create an array main where you assign all the positions to the same group, then:
for each bitstring curr
for each position i
if (curr[i] == 1)
// max of main can be stored for constant time access
main[i] += max of main from previous iteration
Then all the distinct numbers in main are your different sets (possibly using the actual union-find algorithm).
Example:
So, main = 22222. (I won't use 1 as groups to reduce possible confusion, as curr uses bitstrings).
curr = 10000
main = 42222 // first bit (=2) += max (=2)
curr = 11100
main = 86622 // first 3 bits (=422) += max (=4)
curr = 11111
main = 16-14-14-10-10
curr = 01110
main = 16-30-30-26-10
curr = 00011
main = 16-30-30-56-40
Then split by distinct numbers:
{10000, 01100, 00010, 00001}
Improvement:
To reduce the speed at which main increases, we can replace
main[i] += max of main from previous iteration
with
main[i] += 1 + (max - min) of main from previous iteration
EDIT: Edit based on j_random_hacker's comment
You could combine the passes of the stupid algorithm at the cost of space.
Make a bit vector called violations that is (k - 1) k / 2 bits long (so, 496 for k = 32.) Take a single pass over character sets. For each, and for each pair of letters, look for violations (i.e. XOR the bits for those letters, OR the result into the corresponding position in violations.) When you're done, negate and read off what's left.
You could give Principal Component Analysis a try. There are some flavors of PCA designed for binary or more generally for categorical data.
Since someone showed it as NP complete, for large vocabs I doubt you will do better than a brute force search (with various pruning possible) of the entire set of possibilities O((2k-1) * n). At least in a worst case scenario, probably some heuristics will help in many cases as outlined in the paper you linked. This is your "stupid" approach generalized to all possible basis strings instead of just basis of length 2.
However, for small vocabs, I think an approach like this would do a lot better:
Are your words disjoint? If so, you are done (simple case of independent words like "abc" and "def")
Perform bitwise and on each possible pair of words. This gives you an initial set of candidate basis strings.
Goto step 1, but instead of using the original words, use the current basis candidate strings
Afterwards you also need to include any individual letter which is not a subset of one of the final accepted candidates. Maybe some other minor bookeeping for things like unused letters (using something like a bitwise or on all possible words).
Considering your simple example:
First pass gives you a, abc, bc, bcd, de, d
Second pass gives you a, bc, d
Bookkeeping gives you a, bc, d, e
I don't have a proof that this is right but I think intuitively it is at least in the right direction. The advantage lies in using the words instead of the brute force's approach of using possible candidates. With a large enough set of words, this approach would become terrible, but for vocabularies up to say a few hundred or maybe even a few thousand I bet it would be pretty quick. The nice thing is that it will still work even for a huge value of k.
If you like the answer and bounty it I'd be happy to try to solve in 20 lines of code :) and come up with a more convincing proof. Seems very doable to me.

Resources