How To Distinguish The Extra bits In Huffman Decoding - bit

I'm trying to decode a file using Huffman. Suppose I got the characters AAAAABBBC and Suppose the codes for distinct characters are :
A: 1
B: 01
C: 00
and the encoded file looks like this: 11111010 10100000
notice that I don't need the last 3 bits which are 000. How do I know that I don't need these bits while Decoding ?

You don't know. You need a way to terminate the bit stream, since it is stored in bytes. Either you need to provide the number of bits in the bit stream or the number of symbols to decode from the bit stream before the bit stream, or you need to add an end-of-stream symbol, which when encountered ends the decoding.

Related

How to detect that bit input ended?

I write LZW/Huffman encoder/decoder. LZW coder sends the number of bits according to its table: if it has less then 2^n elements, it sends n bits. Huffman coder recieve these bits byte by byte and encode them into specific number of bits according to its tree.
So the problem is the last byte can contain less then 8 bits. If i use EOF to detect end of input when decoding i can suddenly get EOF value before actual end of input. And if i send/recieve 4 bytes considering the first one is for sign, i lose 1 bit every 4 bytes.
Should i lose these first bits or there is another better solution i don't know?

ruby base64 encode 128 bit number by starting with a 2 bit character to prevent padding at the end

This question is a follow up to my previous question here: How can I convert a UUID to a string using a custom character set in Ruby?
But I will try to formulate it as a separate and specific question.
I do have a Ruby 128 bit UUID as hex value:
SecureRandom.uuid #=> "2d931510-d99f-494a-8c67-87feb05e1594"
If I get the IFC specification correctly (http://www.buildingsmart-tech.org/ifc/IFC2x3/TC1/html/ifcutilityresource/lexical/ifcgloballyuniqueid.htm), I want to Base64 encode this, but instead of getting padding at the end, I want the output to begin with a 2bit character(4 options), instead of 6 bits(needed for 64 options).
This way I think I can end up with a string of 22 characters (1 of 2bits, and 21 of 6bits for a total of 128 bits).
Is it possible to tweak the Ruby base64 in this way?
Short answer: no. Technically speaking, that is not standard Base64 so Ruby's standard lib will not deal with it.
Ruby's Base64 lib takes its input as bytes, so you need to get your input data to be divisible by 8. But you want 4 zero bits in front of your UUID, so that's 4+128=132 so the next closest multiple of 8 is 136 i.e. 17 bytes. You can discard the extra randomness at the end:
x = SecureRandom.gen_random 17 # get a little extra randomness
x[0] = (x[0].ord & 0x0f).chr # 0 out the first four bits
Base64.strict_encode64(x)[0...22] # discard extra randomness
The one downside of this approach is that your 128 bit UUID is weirdly aligned inside x and hard to see on its own. If you want to get the 128 bits out you can do that with some pack/unpack:
[x.unpack("B*")[0][4...132]].pack("B*")

What would be the contents of canonical Huffman encoded bit stream?

Let's consider the following example with symbol- code length - canonical code data.
A - 2 - 00
B - 2 - 01
D - 2 - 10
C - 3 - 110
E - 3 - 111
I was wondering what would be the contents of encoded bit stream? Is it 00 01 10 110 111 (basically all codes) or 2,2,2,3,3 in binary equivalent as corresponding code lengths? I wanted to add here that some resources say just transmit code as encoded bit stream and few other resources talk about throwing code away from encoded bit stream and transmit only code length data.
Encoded bitstream
The code is:
00 01 10 110 111
Note that if we sent the code of 2,2,2,3,3, then it would be impossible to decide if the input was AAACC or BBBEE (or many other equivalent choices).
Because Huffman codes are a prefix code it means that we can unambiguously decode the bitstream despite not knowing where the spaces are.
In other words, when given the output 000110110111, we can uniquely decode it as ABDCE.
Transmitting code table
I think the confusion may be because you need to possess two things to decode the bitstream:
The coded bitstream
The lookup table
These two things are often coded in very different ways.
In many cases the lookup table is fixed in advance so does not need to be transmitted.
However, if the probabilities can change, then we need to tell the recipient what code table to use. In this case we can just transmit the lengths of each code word and this gives enough information for the receiver to construct the canonical Huffman code. Alternatives are also possible, for example we can send the number of each code word length followed by the values. This alternative is used by JPEG and explained more below.
Example
The JPEG image codec uses Huffman tables. Normally some default tables are used, but it is possible to optimize the size of images by transmitting a custom Huffman code. A tutorial about this is here.
Another description of the way of transmitting the Huffman table is here. The code lengths are sent (as bytes) followed by the code values (again as bytes).
Code to read it (taken from the link) is:
// Next sixteen bytes are the counts for each code length
u8 counts[16];
for (i = 0; i < 16; i++) {
counts[i] = fgetc(fp);
ctr++;
}
// Remaining bytes are the data values to be mapped
// Build the Huffman map of (length, code) -> value
for (i = 0; i < 16; i++) {
for (j = 0; j < counts[i]; j++) {
huffData[table][huffKey(i + 1, code)] = fgetc(fp);
code++;
ctr++;
}
code <<= 1;
}
What you are asking is how to send a description of the code to the receiver, so that the receiver knows how to decode the following code values.
There are many ways of varying levels of sophistication, depending on how much effort you want to put into compressing the description of the code. Peter de Rivaz describes a simple approach used by JPEG, which is to send 16 counts of the number of codes of each length, followed by the byte values of each of those symbols. So for your code that would be (in hex):
00 03 02 00 00 00 00 00 00 00 00 00 00 00 00 00 41 42 43 44 45
That's not terribly compact, and it can't represent one of the possible codes, which is 256 8-bit codes, since you are limited to a count of 255 for each length.
The first thing you can do is cut off the code lengths when you have a complete code. It is easy to calculate how many code patterns are left, in which case you can simply end it when there are none left. Follow that with the symbols. You then have:
00 03 02 41 42 43 44 45
We don't need eight bits for each count, since they are limited by the constraints on those counts. For example, you can't have more than two one-bit codes. So we could code these in fewer bits, e.g. n+1 bits for n codes. So two bits, three, bits, and so on until the code is complete. For your code, now in binary:
00 011 0010
followed by the bytes 41 42 43 44 45, offset in the bit stream appropriately. Now the list of counts takes nine bits instead of 24. Since we know that there can only be 256 symbols, we can cap off the number of bits for each count at nine, allowing for the count 256, solving the previous problem of not being able to represent the flat code. Then if the code is limited to 16 bits in length (as it is for JPEG), the largest number of bytes needed for the counts is 14.5, less than the original 16. Often the counts will end before 14.5 bytes.
You can get even more sophisticated, noting that at each code length, you have a limit on the possible count of codes of that length due to the shorter code lengths using up patterns. Then the number of bits for each count can be variable, based on how many possible values there are. Then the counts description would be:
00 011 10, then the eight-bit values 41 42 43 44 45
Since we have no preceding patterns used up for lengths one and two, those still need to be two and three bits respectively. However we now have only three possibilities left for length three: the counts 0, 1, or 2. A count of 3 would oversubscribe the code. So we can use two bits for that last one. It is now seven bits instead of nine, and this greatly reduces the number of bits in the counts for codes that use longer code lengths.
An entirely different scheme is the one used by the deflate format (used in zip, gzip, zlib, png, etc.). There the number of code lengths to follow is sent first, followed by the code length of each symbol in order up to the last one. The symbols themselves are implied by the code length location. That results in lots of zeros, to represent symbols that are not present. So for your code there would be a 70 to go up to symbol 69 ("E"), followed by 65 zeros, then 2 2 2 3 3. That seems awfully long, and it is. deflate then run-length and Huffman codes that list of lengths, to compress it. The long strings of zeros get compressed to a few bits, and the short lengths are also just a few bits each. So then you have to first send a description of the code lengths code lengths code (!) so that you can decode that.
You can read the deflate specification for more information on that scheme. brotli uses a similar scheme, with more sophistication still.

Handing 32-bit ifindex inside of snmp

In developing my own SNMP poller, I've come across the problem of being able to poll devices with 32-bit interface indexes. I can't find anything out there explaining how to covert the hex (5 bytes) into the 32 bit integer or from the integer into hex as it doesn't use simple hex conversion. Example, the interface index is 436219904. While doing a pcap with a snmpget, I see the hex for this is 81 d0 80 e0 00 which makes no sense. I cannot for the life of me figure out how that converts to an integer value. I've tried to find a RFC dealing with this and have had no luck. The 16 bit interface values convert as they should. 0001 = 1 and so on. Only the 32-bit ones seem to be giving me this problem. Any help is appreciated.
SNMP uses ASN.1 syntax to encode data. Thus, you need to learn the BER rules,
http://en.wikipedia.org/wiki/X.690
For your case, I can say you watched the wrong data, as if 436219904 is going to be encoded as Integer32 in SNMP, the bytes should be 1A 00 30 00.
I guess you have missed some details in the analysis, so you might want to do it once again and add more descriptions (screenshot and so on) to enrich your question.
I suspect the key piece of info missing from your question is that the ifIndex value in question as used in your polling is an index for the table polled (not mentioned which, but we could assume ifTable), which means it will be encoded in as a subidentifier of the OID being polled (give me [some property] for [this ifIndex]) versus a requested value (give me [the ifIndex] for [some other row of some other table]).
Per X.209 (the version of the ASN.1 Basic Encoding Rules used by SNMP) subidentifiers in OIDs (other than the first two) are encoded in one or more octets (8 bits) with the highest order bit used as a continuation bit (i.e. "next octet is part of this subidentifier too"), and then remaining 7 bits for the actual value.
In other words, in your value "81 d0 80 e0 00", the highest bit is set in each of the first 4 octets and cleared in the last octet: this is how you know there are 5 octets in the subidentifier. The remaining 7 bits of each of those octets are concatenated to arrive at the integer value.
The converse of course is that to encode an integer value into a subidentifier of an OID, you have to build it 7 bits at a time.

Convert a byte to something human-readable

I've got a file consisting of a bunch of 'rows' of data (not actually on separate lines). As far as I can tell from the description (PDF), each 'row' has forty bytes precisely, as follows:
4 bytes I can ignore.
2 bytes that are the byte representations of the (decimal) integer 240 or 241 — that is, 00 F0 or 00 F1 — depending on 'row'.
4 bytes I can ignore.
11 bytes: the first few bytes are an ASCII string, which I need, and the rest is padding by 00 bytes.
1 byte I can ignore.
1 byte I need. This seems from the documentation to be ASCII 'B', 'S', or '0'.
1 byte that's the byte representation of a small integer — that is, 00, 01, or the like.
4 bytes that are the byte representation of an integer.
4 bytes that are the byte representation of an integer.
4 bytes that are the byte representation of an integer.
4 bytes that are the byte representation of an integer.
And there's nothing else in the file (no file header, for example). (I may well be wrong in my understanding of the linked-to documentation, and would appreciate any correction; more info may be available elsewhere (PDF).)
I wish to:
convert each byte representation of a number to a human-readable representation of the number and each 00 byte to (say) a human-readable 0, and
convert the file to comma-separated or the like.
Now, step (2) should be doable using sed; I mention it only because I want to make sure step (1) is done is such a manner as to allow step (2) (for example, keep track of how many bytes are each field when doing step (1)). But step (1) I have no idea how to do. Can anyone help?
As a caveat, please note that I'm comfortable with sed and bash and can handle perl, but have no experience with real programming; and that, alas, I'm doing this on a Windows machine I don't have installation-of-programs rights on, so (although I have a sed port) I don't have bash. So, basically, I need to do this in sed or a Windows (DOS) command-line script. (I should be able to download the files to another machine, work with them, and upload them, though, if that turns out to be necessary.)
Perl has an unpack function you could use.

Resources