How to detect that bit input ended? - bit

I write LZW/Huffman encoder/decoder. LZW coder sends the number of bits according to its table: if it has less then 2^n elements, it sends n bits. Huffman coder recieve these bits byte by byte and encode them into specific number of bits according to its tree.
So the problem is the last byte can contain less then 8 bits. If i use EOF to detect end of input when decoding i can suddenly get EOF value before actual end of input. And if i send/recieve 4 bytes considering the first one is for sign, i lose 1 bit every 4 bytes.
Should i lose these first bits or there is another better solution i don't know?

Related

Could a CRC32 key with a most or least significant bit of 0 be valid?

I have a server receiving UDP packets with the payload being a number of CRC32 checksumed 4 byte words. The header in each UDP packet has a 2 byte field holding the "repeating" key used for the words in the payload. The way I understand it is that in CRC32 the keys must start and end with a 1 in the binary representation of the key. In other words the least and most significant bits of the key must be a 1 and not 0. So my issue is that I get, for example, the first UDP packet received has the key holding field reading 0x11BC which would have the binary representation 00010001 10111100. So the 1's are neither right nor left aligned to the key holding word. There are trailing 0's on both sides. Is my understanding on valid CRC32 keys wrong then? I ask as I'm trying to write the code to check each word using the key as is and it seems to always give a remainder meaning every word in the payload has an error and yet the instructions I've been given guarantee that the first packet received in the sample given has no errors.
Although it is true that CRC polynomials always have the top and bottom bit set, often this is dealt with implicitly; a 32-bit CRC is actually a 33-bit calculation and the specified polynomial ordinarily omits the top bit.
So e.g. the standard quoted polynomial for a CCITT CRC16 is 0x1021, which does not have its top bit set.
It is normal to include the LSB, so if you're certain you know which way around the polynomial has been specified then either the top or the bottom bit of your word should be set.
However, for UDP purposes you've possibly also made a byte ordering error on one side of the connection or the other? Network byte ordering is conventionally big endian whereas most processors today are little — is one side of the link switching byte order but not the other?

Why does MQTT use such a strange encoding scheme for remaining length?

I've recently started writing an MQTT library for a microcontroller. I've been following the specification document. Section 2.2.3 explains how the remaining length field (part of the fixed header) encodes the number of bytes to follow in the rest of the packet.
It uses a slightly odd encoding scheme:
Byte 0 = a mod 128, a /= 128, if a > 0, set top bit and add byte 1
Byte 1 = a mod 128, a /= 128, if a > 0, set top bit... etc
This variable length encoding seems odd in this application. You could easily transmit the same number using fewer bytes, especially once you get into numbers that take 2-4 bytes using this scheme. MQTT was designed to be simple to use and implement. So why did they choose this scheme?
For example, decimal 15026222 would be encoded as 0xae 0x90 0x95 0x7, however in hexadecimal it's 0xE5482E -- 3 bytes instead of four. The overhead in calculating the encoding scheme and decoding it at the other end seems to contradict the idea that MQTT is supposed to be fast and simple to implement on an 8-bit microcontroller.
What are the benefits to this encoding scheme? Why is it used? The only blog post I could find that even mentions any motivation is this one, which says:
The encoding of the remaining length field requires a bit of additional bit and byte handling, but the benefit is that only a single byte is needed for most messages while preserving the capability to send larger message up to 268’435’455 bytes.
But that doesn't make sense to me. You could have even more messages be only a single byte if you used the entire first byte to represent 0-255 instead of 0-127. And if you used straight hexadecimal, you could represent a number as large as 4 294 967 295 instead of only 268 435 455.
Does anyone have any idea why this was used?
As the comment you cited explains, under the assumption that "only a single byte is needed for most messages", or in other words, under the assumption that most of the time a <= 127 only a single byte is needed to represent the value.
The alternatives are:
Use a value to explicitly indicate how many bytes (or bits) are needed for a. This would require dedicating at least 2 bits to support at most "4 byte" sized a for all messages.
Dedicate a fixed size for a, probably 4 bytes, for all messages. This is inferior if many (read: most) messages don't utilize this size and can't support larger values if that becomes a requirement.

LZ4 compression algorithm explanation

Description from Wikipedia:
The LZ4 algorithm represents the data as a series of sequences. Each sequence begins with a one byte token that is broken into two 4 bit fields. The first field represents the number of literal bytes that are to be copied to the output. The second field represents the number of bytes to copy from the already decoded output buffer (with 0 representing the minimum match length of 4 bytes). A value of 15 in either of the bitfields indicates that the length is larger and there is an extra byte of data that is to be added to the length. A value of 255 in these extra bytes indicates that yet another byte to be added. Hence arbitrary lengths are represented by a series of extra bytes containing the value 255. After the string of literals comes the token and any extra bytes needed to indicate string length. This is followed by an offset that indicates how far back in the output buffer to begin copying. The extra bytes (if any) of the match-length come at the end of the sequence
I didn't understand that at all! Does anyone have an easy way to understand example?
For example, in the above explanation what is a literal byte and what is a match? How can we have a decoded output buffer when we're just beginning to compress? Length of what?
The explanation at here was also impenetrable for me.
A simple example would be nice unless you have a better way of explaining it.
First, read about LZ77, the core approach being used. The text is a description of a particular way to code a series of literals and string matches in the preceding data.
A match is when the next bytes in the uncompressed data occur in the previously decompressed data. So instead of sending those bytes directly, a length and an offset is sent. Then you go offset bytes backwards and copy length bytes to the output.
Yes, you can't have a match at the beginning of the stream. You have to start with literals. (Unless there is a preset dictionary, which is another topic.)

Sorting 100 unique numbers by using 40bytes of memory

I've been asked a good programming problem:
In the input I've got 100 unique numbers from 0-255(1 byte). I can only read one number at a time and only once. I've got 40 bytes of memory which I can use. The goal is to sort all numbers and print them in the output. I know for sure that the uniqueness of the numbers is very important.
Any ideas?
32 bytes give you 256 bits, just enough to maintain a bit map of which of the 256 possible byte values are seen in the input. One additional byte is used to store the input value. Read each value, mark it in the bitmap, then discard. Once you've read all 100 input values, simply write out the value associated with the bits you set in the bit map.
Then ask what you are supposed to do with the other 7 bytes :)
Since your numbers are unique and they are only 1-byte long, they have to be within 0 to 255. Treat your 40 bytes of storage as a long bit vector. As you read each number, set the appropriate bit in this 320-bit bit-vector. When you're done reading the input, turn around and scan through this bit-vector, printing the number corresponding to each set bit.
In response to #JavaNewb, here is some more detail. First, since a byte contains 8 bits, it can assume only one of 256 possible values, namely, 0 through 255. Armed with this little factoid, you look at the 40-byte storage array you have. This array turns out to have 40 bytes * 8 bits/byte = 320 bits. Since the problem states that each of the 100 1-byte numbers are unique, you know that you will see a particular number (which can range from 0 through 255) at most once. Each time you see a number, you set the corresponding bit in the 40-byte array. For instance, if you encounter the number 50, you set bit number 2 in byte number 6. A number N corresponds to bit N%8 in byte N/8. You are guaranteed to never encounter a set bit in this array since that would imply the existence of duplicates in the 100 numbers. After you've read in all the numbers, you look at the 40-byte array. Each bit that is set in this array corresponds to one of the 100 numbers you read in. By traversing this 40-byte array from the 0th bit in the 0th byte all the way to the 7th bit in the 31st byte, you will by:
Capturing all the numbers that were read in
Observing them in a sorted order
All you have to do now is print the numbers corresponding to the set bits as you traverse the 40-byte array.

Encoding / Error Correction Challenge

Is it mathematically feasible to encode and initial 4 byte message into 8 bytes and if one of the 8 bytes is completely dropped and another is wrong to reconstruct the initial 4 byte message? There would be no way to retransmit nor would the location of the dropped byte be known.
If one uses Reed Solomon error correction with 4 "parity" bytes tacked on to the end of the 4 "data" bytes, such as DDDDPPPP, and you end up with DDDEPPP (where E is an error) and a parity byte has been dropped, I don't believe there's a way to reconstruct the initial message (although correct me if I am wrong)...
What about multiplying (or performing another mathematical operation) the initial 4 byte message by a constant, then utilizing properties of an inverse mathematical operation to determine what byte was dropped. Or, impose some constraints on the structure of the message so every other byte needs to be odd and the others need to be even.
Alternatively, instead of bytes, it could also be 4 decimal digits encoded in some fashion into 8 decimal digits where errors could be detected & corrected under the same circumstances mentioned above - no retransmission and the location of the dropped byte is not known.
I'm looking for any crazy ideas anyone might have... Any ideas out there?
EDIT:
It may be a bit contrived, but the situation that I'm trying to solve is one where you have, let's say, a faulty printer that prints out important numbers onto a form, which are then mailed off to a processing firm which uses OCR to read the forms. The OCR isn't going to be perfect, but it should get close with only digits to read. The faulty printer could be a bigger problem, where it may drop a whole number, but there's no way of knowing which one it'll drop, but they will always come out in the correct order, there won't be any digits swapped.
The form could be altered so that it always prints a space between the initial four numbers and the error correction numbers, ie 1234 5678, so that one would know whether a 1234 initial digit was dropped or a 5678 error correction digit was dropped, if that makes the problem easier to solve. I'm thinking somewhat similar to how they verify credit card numbers via algorithm, but in four digit chunks.
Hopefully, that provides some clarification as to what I'm looking for...
In the absence of "nice" algebraic structure, I suspect that it's going to be hard to find a concise scheme that gets you all the way to 10**4 codewords, since information-theoretically, there isn't a lot of slack. (The one below can use GF(5) for 5**5 = 3125.) Fortunately, the problem is small enough that you could try Shannon's greedy code-construction method (find a codeword that doesn't conflict with one already chosen, add it to the set).
Encode up to 35 bits as a quartic polynomial f over GF(128). Evaluate the polynomial at eight predetermined points x0,...,x7 and encode as 0f(x0) 1f(x1) 0f(x2) 1f(x3) 0f(x4) 1f(x5) 0f(x6) 1f(x7), where the alternating zeros and ones are stored in the MSB.
When decoding, first look at the MSBs. If the MSB doesn't match the index mod 2, then that byte is corrupt and/or it's been shifted left by a deletion. Assume it's good and shift it back to the right (possibly accumulating multiple different possible values at a point). Now we have at least seven evaluations of a quartic polynomial f at known points, of which at most one is corrupt. We can now try all possibilities for the corruption.
EDIT: bmm6o has advanced the claim that the second part of my solution is incorrect. I disagree.
Let's review the possibilities for the case where the MSBs are 0101101. Suppose X is the array of bytes sent and Y is the array of bytes received. On one hand, Y[0], Y[1], Y[2], Y[3] have correct MSBs and are presumed to be X[0], X[1], X[2], X[3]. On the other hand, Y[4], Y[5], Y[6] have incorrect MSBs and are presumed to be X[5], X[6], X[7].
If X[4] is dropped, then we have seven correct evaluations of f.
If X[3] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 3, and six correct evaluations.
If X[5] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 5, and six correct evaluations.
There are more possibilities besides these, but we never have fewer than six correct evaluations, which suffices to recover f.
I think you would need to study what erasure codes might offer you. I don't know any bounds myself, but maybe some kind of MDS code might achieve this.
EDIT: After a quick search I found RSCode library and in the example it says that
In general, with E errors, and K erasures, you will need
* 2E + K bytes of parity to be able to correct the codeword
* back to recover the original message data.
So looks like Reed-Solomon code is indeed the answer and you may actually get recovery from one erasure and one error in 8,4 code.
Parity codes work as long as two different data bytes aren't affected by error or loss and as long as error isn't equal to any data byte while a parity byte is lost, imho.
Error correcting codes can in general handle erasures, but in the literature the position of the erasure is assumed known. In most cases, the erasure will be introduced by the demodulator when there is low confidence that the correct data can be retrieved from the channel. For instance, if the signal is not clearly 0 or 1, the device can indicate that the data was lost, rather than risking the introduction of an error. Since an erasure is essentially an error with a known position, they are much easier to fix.
I'm not sure what your situation is where you can lose a single value and you can still be confident that the remaining values are delivered in the correct order, but it's not a situation classical coding theory addresses.
What algorithmist is suggesting above is this: If you can restrict yourself to just 7 bits of information, you can fill the 8th bit of each byte with alternating 0 and 1, which will allow you to know the placement of the missing byte. That is, put a 0 in the high bit of bytes 0, 2, 4, 6 and a 1 in the high bits of the others. On the receiving end, if you only receive 7 bytes, the missing one will have been dropped from between bytes whose high bits match. Unfortunately, that's not quite right: if the erasure and the error are adjacent, you can't know immediately which byte was dropped. E.g., high bits 0101101 could result from dropping the 4th byte, or from an error in the 4th byte and dropping the 3rd, or from an error in the 4th byte and dropping the 5th.
You could use the linear code:
1 0 0 0 0 1 1 1
0 1 0 0 1 0 1 1
0 0 1 0 1 1 0 1
0 0 0 1 1 1 1 0
(i.e. you'll send data like (a, b, c, d, b+c+d, a+c+d, a+b+d, a+b+c) (where addition is implemented with XOR, since a,b,c,d are elements of GF(128))). It's a linear code with distance 4, so it can correct a single-byte error. You can decode with syndrome decoding, and since the code is self-dual, the matrix H will be the same as above.
In the case where there's a dropped byte, you can use the technique above to determine which one it is. Once you've determined that, you're essentially decoding a different code - the "punctured" code created by dropping that given byte. Since the punctured code is still linear, you can use syndrome decoding to determine the error. You would have to calculate the parity-check matrix for each of the shortened codes, but you can do this ahead of time. The shortened code has distance 3, so it can correct any single-byte errors.
In the case of decimal digits, assuming one goes with first digit odd, second digit even, third digit odd, etc - with two digits, you get 00-99, which can be represented in 3 odd/even/odd digits (125 total combinations) - 00 = 101, 01 = 103, 20 = 181, 99 = 789, etc. So one encodes two sets of decimal digits into 6 total digits, then the last two digits signify things about the first sets of 2 digits or a checksum of some sort... The next to last digit, I suppose, could be some sort of odd/even indicator on each of the initial 2 digit initial messages (1 = even first 2 digits, 3 = odd first two digits) and follow the pattern of being odd. Then, the last digit could be the one's place of a sum of the individual digits, that way if a digit was missing, it would be immediately apparent and could be corrected assuming the last digit was correct. Although, it would throw things off if one of the last two digits were dropped...
It looks to be theoretically possible if we assume 1 bit error in wrong byte. We need 3 bits to identify dropped byte and 3 bits to identify wrong byte and 3 bits to identify wrong bit. We have 3 times that many extra bits.
But if we need to identify any number of bits error in wrong byte, it comes to 30 bits. Even that looks to be possible with 32 bits, although 32 is a bit too close for my comfort.
But I don't know hot to encode to get that. Try turbocode?
Actually, as Krystian said, when you correct a RS code, both the message AND the "parity" bytes will be corrected, as long as you have v+2e < (n-k) where v is the number of erasures (you know the position) and e is the number of errors. This means that if you only have errors, you can correct up to (n-k)/2 errors, or (n-k-1) erasures (about the double of the number of errors), or a mix of both (see Blahut's article: Transform techniques for error control codes and A universal Reed-Solomon decoder).
What's even nicer is that you can check that the correction was successful: by checking that the syndrome polynomial only contains 0 coefficients, you know that the message+parity bytes are both correct. You can do that before to check if the message needs any correction, and also you can do the check after the decoding to check that both the message and the parity bytes were completely repaired.
The bound v+2e < (n-k) is optimal, you cannot do better (that's why Reed-Solomon is called an optimal error correction code). In fact it's possible to go beyond this limit using bruteforce approaches, up to a certain point (you can gain 1 or 2 more symbols for each 8 symbols) using list decoding, but it's still a domain in its infancy, I don't know of any practical implementation that works.

Resources