Issues with LZW algorithm variable-length decoding procedure - algorithm

The setup
Say I've got:
A series of numbers resulting from LZW compression of a bitmap:
256 1 258 258 0 261 261 259 260 262 0 264 1 266 267 258 2 273 2 262 259 274 275 270 278 259 262 281 265 276 264 270 268 288 264 257
An LZW-compressed, variable-length-encoded bytestream (including the LZW code size header and sub-block markers) which represents this same series of numbers:
00001000 00101001 00000000 00000011 00001000 00010100 00001000 10100000
01100000 11000001 10000001 00000100 00001101 00000010 01000000 00011000
01000000 11100001 01000010 10000001 00000010 00100010 00001010 00110000
00111000 01010000 11100010 01000100 10000111 00010110 00000111 00011010
11001100 10011000 10010000 00100010 01000010 10000111 00001100 01000001
00100010 00001100 00001000 00000000
And an initial code width of 8.
The problem
I'm trying to derive the initial series of numbers (the integer array) from the bytestream.
From what I've read, the procedure here is to take the initial code width, scan right-to-left, reading initial code width + 1 bits at a time, to extract the integers from the bytestream. For example:
iteration #1: 1001011011100/001/ yield return 4
iteration #2: 1001011011/100/001 yield return 1
iteration #3: 1001011/011/100001 yield return 6
iteration #4: 1001/011/011100001 yield return 6
This procedure will not work for iteration #5, which will yield 1:
iteration #5: 1/001/011011100001 yield return 1 (expected 9)
The code width should have been increased by one.
The question
How am I supposed to know when to increase the code width when reading the variable-length-encoded bytestream? Do I have all of the required information necessary to decompress this bytestream? Am I conceptually missing something?
UPDATE:
After a long discussion with greybeard - I found out that I was reading the binary string incorrectly: 00000000 00000011 00 is to be interpreted as 256, 1. The bytestream is not read as big-endian.
And very roughly speaking, if you are decoding a bytestream, you increase the number of bits read every time you read 2^N-1 codes, where N is the current code width.

Decompressing, you are supposed to build a dictionary in much the same way as the compressor. You know you need to increase the code width as soon as the compressor might use a code too wide for the current width.
As long as the dictionary is not full (the maximum code is not assigned), a new code is assigned for every (regular) code put out (not the Clear Code or End Of Information codes).
With the example in the presentation you linked, 8 is assigned when the second 6 is "transmitted" - you need to switch to four bits before reading the next code.
(This is where the example and your series of numbers differ - the link presents 4, 1, 6, 6, 2, 9.)

Related

How to calculate the EXTENDED_PAYLOAD_LENGTH if the PAYLOAD_LENGTH is 126 in a WebSocket frame data?

My goal is calculate the payload length of a message sent by a client to the server through the WebSocket protocol.
I am using RFC6455 as a reference.
The length of the "Payload data", in bytes: if 0-125, that is the
payload length. If 126, the following 2 bytes interpreted as a
16-bit unsigned integer are the payload length. If 127, the
following 8 bytes interpreted as a 64-bit unsigned integer (the
most significant bit MUST be 0) are the payload length.
Here is a sample of frame data of message abcdef from a client to the server.
129 134 167 225 225 210 198 131 130 182 194 135
What if, the second byte value is 254, in this case 254 - 128 equals to 126.
129 254 167 225 225 210 198 131 130 182 194 135
I am assuming the third and fourth byte are the EXTENDED_PAYLOAD_LENGTH. In this case, because the second byte - 128 equals to 126, the EXTENDED_PAYLOAD_LENGTH is the actual payload length.
However, in the RFC, it specifically says 16-bit instead of the next two bytes.
If 126, the following 2 bytes interpreted as a
16-bit unsigned integer are the payload length.
How do you combine the third and fourth byte to get the actual payload length? In this case, I wrote the code in Java, and currently I don't have access to read the frame data bit by bit, instead I read it byte by byte.
PS: I am thinking to use bitwise operators, am I going to the right direction?

Unpacking a payload using Ruby

I have a hex payload 1703cd0650000aff79fc2802 which I would like to unpack to values. I have been given a definition:
Byte 1:
0bxx01, order: MS nibble
0byyxx, order: LS nibble
Byte 2 0bxxxxxxxx MSB
Byte 3 0bxxxxxxxx LSB
and so on for 12 bytes in total
And I concat byte 2+3 to get a distance in mm. Could anyone help me with an unpack that could achieve this?
I have tried to unpack using 'B8B8B8':
Byte 1, part 1: I get 0001 which I calc to be 1
Byte 1, part 2: I get 1110, which I have reversed (as LSB) which gives me 14
Byte 2: I get 00000011, which I think is 3
Byte 3: I get 11001101, which I reversed to 10110011 (as LSB), which I think is 179
So is the value 3 + 179 concat 3179 or ('00000011' + '10110011').to_i(2) which is 973
I have also tried to unpack with 'B8S>', and byte2 also gives me 973.
Thanks

Understanding .bmp file

I have a .bmp file
I sort of do understand and sort of do not understand. I understand that the first 14 bytes are my Bitmapfileheader. I furthermore do understand that my Bitmapinfoheader contains information about the bitmap as well and is about 40 bytes large (in version 3).
What I do not understand is, how the information is stored in there.
I have this image:
Why is all the colorinformation stored in "FF"? I know that the "00" are "Junk Bytes". What I do not understand is why there is everything in "FF"?!
Furthermore, I do not understand what type of "encoding" that is? 42 4D equals do "BM". What is that? How can I translate what I see there to colors or letters or numbers?!
What I can read in this case:
Bitmapfileheader:
First 2 bytes. BM if it is a .bmp file: 42 4D = BM (However 42 4D transforms to BM)
Next 4 Bytes: Size of the bitmap. BA 01 00 00. Dont know what size that should be.
Next 4 Bytes: Something reserved.
Next 4 Bytes: Offset (did not quite understand that)
Bitmapinfoheader
Next 4 Bytes: Size of the bitmapinfoheader. 6C 00 00 00 here.
Next 4 Bytes: Width of the .bmp. 0A 00 00 00. I know that that must be 10px since I created that file.
Next 4 Bytes: Height of the .bmp. 0A 00 00 00. I know that that must be 10px since I created that file.
Next 2 Bytes: Something from another file format.
Next two Bytes: Color depth. 18 00 00 00. I thought that can only by 1,2,4,8, 16, 24, 32?
The first 2 bytes of information that you see "42 4D" are what we call the magic numbers. They are the signature of the file, 42 4d is the hex notation of 01000010 01001101 in binary.
Every file has one, .jpg, .gif. You get it.
Here is an image that illustrate a BMP complete header of 54 bytes(24-bit BMP).
BMP Header
The total size of the BMP is calculated by the size of the header + BMP.width x BMP.height * 3 (1 byte for Red, 1 byte for Green, 1 byte for Blue - in the case of 8bits of information per channel) + padding(if it exists).
Junk bytes that you refer, is the padding, they are needed if the size of each scanline(row) is not a multiple of 4.
White in hex notation if ffffff, being the first two red, green and blue.
While in decimal notation each channel will have the value of 255, because 2^8(8 bits) -1 = 255.
Hope this clears a bit(non intended pun) for you.

What are the possible ways to encode a counter in a byte?

I am reverse engineering an AC remote control. When I send the temperature values (from 17C to 30C) I get the following stream.
Temperature - Binary - Hex - Decimal
17C - 00000000 - 0x00 - 0
18C - 00010000 - 0x10 - 16
19C - 00110000 - 0x30 - 48
20C - 00100000 - 0x20 - 32
21C - 01100000 - 0x60 - 96
22C - 01110000 - 0x70 - 112
23C - 01010000 - 0x50 - 80
24C - 01000000 - 0x40 - 64
25C - 11000000 - 0xc0 - 192
26C - 11010000 - 0xd0 - 208
27C - 10010000 - 0x90 - 144
28C - 10000000 - 0x80 - 128
29C - 10100000 - 0xa0 - 160
30C - 10110000 - 0xb0 - 176
What possible method could they have used to encode the temperature data in the byte?
Is there any method to reverse engineer the stream?
Looking at it it's clear that they are adding powers of 2 but I can't figure out the logic behind it.
The first four bits form a 4-bit Gray code, since every two consecutive bit sequences differ in exactly one position:
0000
0001
0011
0010
0110
0111
0101
0100
1100
1101
1001
1000
1010
1011
[1111] (missing, probably 31C)
Your sample does not give any indication about what happens outside the [17, 31] range.
As for why this pattern would be used, Wikipedia provides some examples:
Gray codes are used in position encoders (linear encoders and rotary encoders), in preference to straightforward binary encoding. This avoids the possibility that, when several bits change in the binary representation of an angle, a misread will result from some of the bits changing before others.

Padding the message in SHA256

I am trying to understand SHA256. On the Wikipedia page it says:
append the bit '1' to the message
append k bits '0', where k is the minimum number >= 0 such that the resulting message
length (modulo 512 in bits) is 448.
append length of message (without the '1' bit or padding), in bits, as 64-bit big-endian
integer
(this will make the entire post-processed length a multiple of 512 bits)
So if my message is 01100001 01100010 01100011 I would first add a 1 to get
01100001 01100010 01100011 1
Then you would fill in 0s so that the total length is 448 mod 512:
01100001 01100010 01100011 10000000 0000 ... 0000
(So in this example, one would add 448 - 25 0s)
My question is: What does the last part mean? I would like to see an example.
It means the message length, padded to 64 bits, with the bytes appearing in order of significance. So if the message length is 37113, that's 90 f9 in hex; two bytes. There are two basic(*) ways to represent this as a 64-bit integer,
00 00 00 00 00 00 90 f9 # big endian
and
f9 90 00 00 00 00 00 00 # little endian
The former convention follows the way numbers are usually written out in decimal: one hundred and two is written 102, with the most significant part (the "big end") being written first, the least significant ("little end") last. The reason that this is specified explicitly is that both conventions are used in practice; internet protocols use big endian, Intel-compatible processors use little endian, so if they were decimal machines, they'd write one hundred and two as 201.
(*) Actually there are 8! = 40320 ways to represent a 64-bit integer if 8-bit bytes are the smallest units to be permuted, but two are in actual use.

Resources