LZW decompression algorithm - algorithm

I'm writing a program for an assignment which has to implement LZW compression/decompression.
I'm using the following algorithms for this:
-compression
w = NIL;
while ( read a character k )
{
if wk exists in the dictionary
w = wk;
else
add wk to the dictionary;
output the code for w;
w = k;
}
-decompression
read a character k;
output k;
w = k;
while ( read a character k )
/* k could be a character or a code. */
{
entry = dictionary entry for k;
output entry;
add w + entry[0] to dictionary;
w = entry;
}
For the compression stage I'm just outputing ints representing the index for the
dictionary entry, also the starting dictionary consists of ascii characters (0 - 255).
But when I come to the decompression stage I get this error
for example if I compress a text file consisting of only "booop"
it will go through these steps to produce an output file:
w k Dictionary Output
- b - -
b o bo (256) 98 (b)
o o oo (257) 111 (o)
o o - -
oo p oop (258) 257 (oo)
p - - 112 (p)
output.txt:
98
111
257
112
Then when I come to decompress the file
w k entry output Dictionary
98 (b) b
b 111 (o) o o bo (256)
o 257 (error)
257 (oo) hasn't been added yet. Can anyone see where I'm going wrong here cause I'm
stumped. Is the algorithm wrong?

Your compression part is right and complete but the decompression part is not complete. You only include the case when the code is in the dictionary. Since the decompression process is always one step behind the compression process, there is the possibility when the decoder find a code which is not in the dictionary. But since it's only one step behind, it can figure out what the encoding process will add next and correctly output the decoded string, then add it to the dictionary. To continue your decompression process like this:
-decompression
read a character k;
output k;
w = k;
while ( read a character k )
/* k could be a character or a code. */
{
if k exists in the dictionary
entry = dictionary entry for k;
output entry;
add w + entry[0] to dictionary;
w = entry;
else
output entry = w + firstCharacterOf(w);
add entry to dictionary;
w = entry;
}
Then when you come to decompress the file and see 257, you find it's not there in the dictionary. But you know the previous entry is 'o' and it's first character is 'o' too, put them together, you get "oo". Now output oo and add it to dictionary. Next you get code 112 and sure you know it's p. DONE!
w k entry output Dictionary
98 (b) b
b 111 (o) o o bo (256)
o 257 (oo) oo oo(257)
oo 112(p) p
See: this explanation by Steve Blackstock for more information. A better page with flow chart for the actual decoder and encoder implementation on which the "icafe" Java image library GIF encoder and decoder are based.

From http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch are you falling into this case?
What happens if the decoder receives a code Z that is not yet in its dictionary? Since the decoder is always just one code behind the encoder, Z can be in the encoder's dictionary only if the encoder just generated it, when emitting the previous code X for χ. Thus Z codes some ω that is χ + ?, and the decoder can determine the unknown character as follows:
1) The decoder sees X and then Z.
2) It knows X codes the sequence χ and Z codes some unknown sequence ω.
3) It knows the encoder just added Z to code χ + some unknown character,
4) and it knows that the unknown character is the first letter z of ω.
5) But the first letter of ω (= χ + ?) must then also be the first letter of χ.
6) So ω must be χ + x, where x is the first letter of χ.
7) So the decoder figures out what Z codes even though it's not in the table,
8) and upon receiving Z, the decoder decodes it as χ + x, and adds χ + x to the table as the value of Z.
This situation occurs whenever the encoder encounters input of the form cScSc, where c is a single character, S is a string and cS is already in the dictionary, but cSc is not. The encoder emits the code for cS, putting a new code for cSc into the dictionary. Next it sees cSc in the input (starting at the second c of cScSc) and emits the new code it just inserted. The argument above shows that whenever the decoder receives a code not in its dictionary, the situation must look like this.
Although input of form cScSc might seem unlikely, this pattern is fairly common when the input stream is characterized by significant repetition. In particular, long strings of a single character (which are common in the kinds of images LZW is often used to encode) repeatedly generate patterns of this sort.
For this specific case, the wikipedia thing fits, you have X+? where X is (o), Z is unknown so far so the first letter is X giving (oo) add (oo) to the table as 257. I am just going on what I read at wikipedia, let us know how this turns out if that is not the solution.

Related

Pairing the weight of a protein sequence with the correct sequence

This piece of code is part of a larger function. I already created a list of molecular weights and I also defined a list of all the fragments in my data.
I'm trying to figure out how I can go through the list of fragments, calculate their molecular weight and check if it matches the number in the other list. If it matches, the sequence is appended into an empty list.
combs = [397.47, 2267.58, 475.63, 647.68]
fragments = ['SKEPFKTRIDKKPCDHNTEPYMSGGNY', 'KMITKARPGCMHQMGEY', 'AINV', 'QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV', 'MQCL', 'HMTPCYELHGLRWV', 'DHTAQPCRSWPMDYPLT', 'IEEATHM', 'MVGKMDMLEQYA', 'GWPDII', 'QIQDY', 'TPCYELHGLRWVQIQDYA', 'HGLRWVQIQDYAINV', 'KKKNARKW', 'TPCYELHGLRWV']
frags = []
for c in combs:
for f in fragments:
if c == SeqUtils.molecular_weight(f, 'protein', circular = True):
frags.append(f)
print(frags)
I'm guessing I don't fully know how the SeqUtils.molecular_weight command works in Python, but if there is another way that would also be great.
You are comparing floating point values for equality. That is bound to fail. You always have to account for some degree of error when dealing with floating point values. In this particular case you also have to take into account the error margin of the input values.
So do not compare floats like this
x == y
but instead like this
abs(x - y) < epsilon
where epsilon is some carefully selected arbitrary number.
I did two slight modifications to your code: I swapped the order of the f and the c loop to be able to store the calculated value of w. And I append the value of w to the list frags as well in order to better understand what is happening.
Your modified code now looks like this:
from Bio import SeqUtils
combs = [397.47, 2267.58, 475.63, 647.68]
fragments = ['SKEPFKTRIDKKPCDHNTEPYMSGGNY', 'KMITKARPGCMHQMGEY', 'AINV', 'QIQD', 'YAINVMQCL', 'IEEATHMTPCYELHGLRWV',
'MQCL', 'HMTPCYELHGLRWV', 'DHTAQPCRSWPMDYPLT', 'IEEATHM', 'MVGKMDMLEQYA', 'GWPDII', 'QIQDY',
'TPCYELHGLRWVQIQDYA', 'HGLRWVQIQDYAINV', 'KKKNARKW', 'TPCYELHGLRWV']
frags = []
threshold = 0.5
for f in fragments:
w = SeqUtils.molecular_weight(f, 'protein', circular=True)
for c in combs:
if abs(c - w) < threshold:
frags.append((f, w))
print(frags)
This prints the result
[('AINV', 397.46909999999997), ('IEEATHMTPCYELHGLRWV', 2267.5843), ('MQCL', 475.6257), ('QIQDY', 647.6766)]
As you can see, the first value for the weight differs from the reference value by about 0.0009. That's why you did not catch it with your approach.

Doubts regarding this pseudocode for the perceptron algorithm

I am trying to implement the perceptron algorithm above. But I have two questions:
Why do we just update w (weight) variable once? Shouldn't there be separate w variables for each Xi? Also, not sure what w = 0d means mathematically in the initialization.
What is the mathematical meaning of
yi(< xi,w >+b)
I kinda know what the meaning inside the bracket is but not sure about the yi() part.
(2) You can think of 'yi' as a function that depends on w, xi and b.
let's say for a simple example, y is a line that separates two different classes. In that case, y can be represented as y = wx+b. Now, if you use
w = 0,x = 1 and b = 0 then y = 0.
For your given algorithm, you need to update your weight w, when the output of y is less than or equal to 0.
So, if you look carefully, you are not updating w once, as it is inside an if statement which is inside a for loop.
For your algorithm, you will get n numbers of output y based on n numbers of input x for each iteration of t. Here 'i' is used for indexing both input as xi and output as yi.
So, long story short, out of n numbers of input x, you only need to update the w when the output y for the corresponding input x will be less than or equal to zero (for each iteration of t).
(1) I have already mentioned w is not updated once.
Let's say you know that any output value greater(<) than 0 is the correct answer. So if you get an output which is less than or equal to zero then there is a mistake in your algorithm and you need to fix it. This is what your algorithm is doing by updating the w when the output is not matching the desired one.
Here w is represented as a vector and it is initialized as zero.

Understanding the meaning of CheckHalt(X,X) in the proof of a theorem in Sussana Epp's Discrete Mathematics with applications

I have a very basic exposure to algorithms. I am a graduate in Mathematics. I was reading Halting Problem in the book Discrete Mathematics with applicationbs by Susanna Epp. It has a following theorem :
Theorem : There is no computer algorithm that will accept any algorithm X and data set D as input and then will output "halts" or "loops forever" to indicate whether or not X terminates in a finite number of steps when X is run with data set D.
Proof : Suppose there is an algorithm, call it CheckHalt, such that if an algorithm X and a data set D are input, then CheckHalt prints "halts" if X terminates in a finite number of steps when run with the data set D or "loops forever" if X does notterminate in a finite number of steps when run with data set D.
Now next lines are those which I don't understand in this proof
Observe that the sequence of characters making up an algorithm X can be regarded as a data set itself. Thus it is possible to consider running a CheckHalt with input (X,X).
So I have understood that CheckHalt essentially gets input as an algorithm X and a data set D. It tells whether if we run the algorithm X with that data set D as it's (X's) input, then X will halt or loop forever. Thus CheckHalt(X,D) seems good.
My question is how can an algorithm X have an input X itself i.e how can we call an algorithm as a data set?
What is the meaning of the sentence "sequence of characters making up an algorithm X"?
I can understand CheckHalt(X,D). But what is CheckHalt(X,X)?
Thanks.
Consider the following algorithm to reverse a string:
function reverse(s) {
var ret = "";
for (var i = s.length - 1; i >= 0; i--) {
ret = ret + s[i];
}
return ret;
}
It takes a string as input, and returns a string. Now consider the following input dataset:
"function reverse(s) {\n"
+ " var ret = \"\";\n"
+ " for (var i = s.length - 1; i >= 0; i--) {\n"
+ " ret = ret + s[i];\n"
+ " }\n"
+ " return ret;\n"
+ "}"
This is a string. It happens to be a string encoding the source of an algorithm. Because it is a string, it is a valid input to algorithms that accept strings; like the algorithm it happens to encode does. Indeed, if pass this algorithm ('s encoding) to itself, you get the following well-defined output:
"}"
+ ";ter nruter "
+ "} "
+ ";]i[s + ter = ter "
+ "{ )--i ;0 => i ;1 - htgnel.s = i rav( rof "
+ ";"" = ter rav "
+ "{ )s(esrever noitcnuf"
In the same way, if you have a program X with a string encoding enc(X) and X accepts a string, you can pass enc(X) to X. If you have another algorithm that takes two strings, you can pass enc(X) as both of the parameters.
A Dataset is a pretty open definition, so it should definitely be conceivable that a dataset would consist of a string of characters. But I think an example will help.
Imagine that X is an algorithm for counting periods (.) in a string. X could be written any number of ways, but if you want imagine this particular way:
Start a count at 0 and a position pointer at 0.
Read the character at pointer position in the string.
If the character is a ., increment our count.
If the character is the last character in the string, return our count.
Increment the position pointer
Go back to step 2.
The six step list I just wrote is itself a string... and can thus be applied to itself as data (we get 12 I think). In this case the algorithm can be applied to itself as data.
In this case, CheckHalt(X,X) would return halt since the algorithm does not loop forever.
Of course, not every algorithm will be able to accept itself as data. For instance, the GCD algorithm needs integer input, so it could not be applied to itself. However, I presume the counter-example being constructed involves an algorithm that can be applied to itself as a string of characters.

Ruby loop order?

I'm trying to bruteforce a password. As I was playing with some loops, I've noticed there's a specific order. Like, if I have for i in '.'..'~' it puts
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
#
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
After seeing this, I wondered to myself "what is the loop order in Ruby?" What character is the highest priority and which is the lowest priority? Sorry if this question is basic. I just haven't found a site where anyone knows. If you have questions about the question just ask. I hope this is clear enough!
The order is defined by the binary representation of the letters. Which, in turn, is defined by a standard. The standard used is ASCII (American Standard Code for Information Interchange).
http://www.asciitable.com/
Other encoding standards exist, like EBCDIC which is used by IBM mid-range computers.
for / in is (mostly) syntactic sugar for each, so
for i in '.'..'~' do puts i end
is (roughly) equivalent (modulo local variable scope) to
('.'..'~').each do |i| puts i end
Which means that we have to look at Range#each for our answer (bold emphasis mine):
each {| i | block } → rng
Iterates over the elements of range, passing each in turn to the block.
The each method can only be used if the begin object of the range supports the succ method. A TypeError is raised if the object does not have succ method defined (like Float).
And the documentation for the Range class itself provides more details:
Custom Objects in Ranges
Ranges can be constructed using any objects that can be compared using the <=> operator. Methods that treat the range as a sequence (#each and methods inherited from Enumerable) expect the begin object to implement a succ method to return the next object in sequence.
So, while it isn't spelled out directly, it is clear that Range#each works by
Repeatedly sending the succ message to the begin object (and then to the object that was returned by succ, and then to that object, and so forth), and
Comparing the current element to the end object using the <=> spaceship combined comparison operator to figure out whether to produce another object or end the loop.
Which means that we have to look at String#succ next:
succ → new_str
Returns the successor to str. The successor is calculated by incrementing characters starting from the rightmost alphanumeric (or the rightmost character if there are no alphanumerics) in the string. Incrementing a digit always results in another digit, and incrementing a letter results in another letter of the same case. Incrementing nonalphanumerics uses the underlying character set’s collating sequence.
Basically, what this means is:
incrementing a letter does what you expect
incrementing a digit does what you expect
incrementing something that is neither a letter nor a digit is arbitrary and dependent on the string's character set's collating sequence
In this particular case, you didn't tell us what the collating sequence of your string is, but I assume it is ASCII, which means you get what is colloquially called ASCIIbetical ordering.
It's not about priority, but the order of their values. As already said, the characters have their own ASCII representation (E.g., 'a' value is 97 and 'z' value is 122).
You could see this for yourself trying this:
('a'..'z').each do |c|
puts c.ord
end
Analogously, this should also work:
(97..122).each do |i|
puts i.chr
end

Hash decryption

I have a hash decryption function. If input is 664804774844 output is agdpeew. I use modulo and division for finding letter index. But in while loop I written i = 7 becauce I know output string (agdpeew) size. How I can find i?
A decryption function:
var f = function (h) {
var letters, result, i;
i = 7;
result = "";
letters = "acdegilmnoprstuw";
while (i) {
Result += letters [parseInt (h % 37)];
h = h / 37;
i--;
}
return result.split("").reverse().join("");
};
An encrypted function:
hash (s) {
h = 7;
letters = "acdegilmnoprstuw";
for(i = 0; i < s.length; i++) {
h = (h * 37 + letters.indexOf(s[i]));
}
return h;
}
It depends on how you handle overflows. If your "encryption" function allows inputs long enough that h would overflow at some point then you are stuffed and your current method of decryption wouldn't work at all.
If you can guarantee no overflowing then your final h will be the sum of terms of the form (An)x^n where An is the nth letter in your sequence converted to a number via your indexof method (and x in this case is 37)
Your decryption basically takes the x^0 term (by using mod x) and then converts that. It then divides by x (using integer maths presumably) to lose the old x^0 term and get a new one to interpret.
This means that you can actually just keep doing this until your h is 0 and at that point you know you've dealt with all the characters.
An interesting note is that x just needs to be greater than length of letters (because An must be less than x). A smaller X would give more possible input characters before overflow.
If you are allowing overflow then you have no way to do this unless you know how long the input was. Even then it might be tricky. If your input is unlimited in length then you could have a 1000 character input and with all those combinations there are a lot of possible values of h. Though in fact there are not. There are still 2^32 possible outcomes (in fact less with your algorithm) and if you have more than 2^32 possible inputs then you cannot possibly have a reversible function because you must have at least 2 inputs that would match that hash value.
This is why leppie says you cannot decrypt a hash value because you lose information in creating it that cannot be recovered. Unless you have constraints or some other information then you are stuck.

Resources