Could a CRC32 key with a most or least significant bit of 0 be valid? - crc32

I have a server receiving UDP packets with the payload being a number of CRC32 checksumed 4 byte words. The header in each UDP packet has a 2 byte field holding the "repeating" key used for the words in the payload. The way I understand it is that in CRC32 the keys must start and end with a 1 in the binary representation of the key. In other words the least and most significant bits of the key must be a 1 and not 0. So my issue is that I get, for example, the first UDP packet received has the key holding field reading 0x11BC which would have the binary representation 00010001 10111100. So the 1's are neither right nor left aligned to the key holding word. There are trailing 0's on both sides. Is my understanding on valid CRC32 keys wrong then? I ask as I'm trying to write the code to check each word using the key as is and it seems to always give a remainder meaning every word in the payload has an error and yet the instructions I've been given guarantee that the first packet received in the sample given has no errors.

Although it is true that CRC polynomials always have the top and bottom bit set, often this is dealt with implicitly; a 32-bit CRC is actually a 33-bit calculation and the specified polynomial ordinarily omits the top bit.
So e.g. the standard quoted polynomial for a CCITT CRC16 is 0x1021, which does not have its top bit set.
It is normal to include the LSB, so if you're certain you know which way around the polynomial has been specified then either the top or the bottom bit of your word should be set.
However, for UDP purposes you've possibly also made a byte ordering error on one side of the connection or the other? Network byte ordering is conventionally big endian whereas most processors today are little — is one side of the link switching byte order but not the other?

Related

maximum field number in protobuf message

The official document for protocol buffers https://developers.google.com/protocol-buffers/docs/proto3 says the maximum field number for fields in protobuf message is 2^29-1. But why is this limit?
Please anyone can explain in some detail? I am newbie to this.
I read answers to the this question at why 2^29-1 is the biggest key in protocol buffers.
But I am not clarified
Each field in an encoded protocol buffer has a header (called key or tag) prefixed to the actual encoded value. The encoding spec defines this key:
Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.
Here the spec says the tag is a varint where the first 3 bits are used to encode the wire type. A varint could encode a 64 bit value, thus just by going on this definition the limit would be 2^61-1.
In addition to this, the Language Guide narrows this down to a 32 bit value at max.
The smallest field number you can specify is 1, and the largest is 2^29 - 1, or 536,870,911.
The reasons for this are not given. I can only speculate for the reasons behind this:
Artificial limit as no one is expecting a message to have that many fields. Just think about fitting a message with that many fields into memory.
As the key is a varint, it isn't simply the next 4 bytes in the raw buffer, rather a variable length of bytes (Java code reading a varint32). Each byte has 7 bit of actual data and 1 bit indicating if the end is reached. It cloud be that for performance reasons it was deemed to be better to limit the range.
Since proto3 is the 3rd version of protocol buffers, it could be that either proto1 or proto2 defined the tag to be a varint32. To keep backwards compatibility this limit is still true in proto3 today.
Because of this line:
#define GOOGLE_PROTOBUF_WIRE_FORMAT_MAKE_TAG(FIELD_NUMBER, TYPE) \
static_cast<uint32>((static_cast<uint32>(FIELD_NUMBER) << 3) | (TYPE))
this line create a "tag", which left only 29 (32 - 3) bits to save field indice.
Don't know why google use uint32 instead of uint64 though, since field number is a varint, may be they think 2^29-1 fields is large enough for a single message declaration.
I suspect this is simply so that a field-header (wire-type and tag-number) can be decoded and handled as a 32-bit value. The wire-type is always the 3 least significant bits, leaving 29 bits for the tag number. Technically "varint" should support 64 bits, but it makes sense to limit it to reasonable numbers, not least because "varint" encoding means that larger numbers take more bytes to encode.
Edit: I realise now that this is similar to the linked post, but... it remain true! Each field in protobuf is prefixed by a "varint" that expresses what field (tag-number) follows, and what data type it is (wire-type). The latter is important especially so that unexpected fields (version differences) can be stored or skipped correctly. It is convenient for that field-header to be trivially processed by most frameworks, and most frameworks are fine with 32-bit integers.
this is another question rather a comment, in the document it says,
Field numbers in the range 16 through 2047 take two bytes. So you
should reserve the numbers 1 through 15 for very frequently occurring
message elements. Remember to leave some room for frequently occurring
elements that might be added in the future.
Because for the first byte, top 5 bits are used for field number, and bottom 3 bits for field type, isn't it that field number from 31 (because zero is not used) to 2047 take two bytes? (and I also guess the second bytes' lower 3 bits are used also for field type.. I'm in the middle of reading it, so I'll fix it when I know it)

How to detect that bit input ended?

I write LZW/Huffman encoder/decoder. LZW coder sends the number of bits according to its table: if it has less then 2^n elements, it sends n bits. Huffman coder recieve these bits byte by byte and encode them into specific number of bits according to its tree.
So the problem is the last byte can contain less then 8 bits. If i use EOF to detect end of input when decoding i can suddenly get EOF value before actual end of input. And if i send/recieve 4 bytes considering the first one is for sign, i lose 1 bit every 4 bytes.
Should i lose these first bits or there is another better solution i don't know?

Are both of these algorithms valid implementations of LZSS?

I am reverse engineering things and I often stumble upon various decompression algorithms. Most of times, it's LZSS just like Wikipedia describes it:
Initialize dictionary of size 2^n
While output is less than known output size:
Read flag
If the flag is set, output literal byte (and append it at the end of dictionary)
If the flag is not set:
Read length and look behind position
Transcribe length bytes from the dictionary at look behind position to the output and at the end of dictionary.
The thing is that the implementations follow two schools of how to encode the flag. The first one treats the input as sequence of bits:
(...)
Read flag as one bit
If it's set, read literal byte as 8 unaligned bits
If it's not set, read length and position as n and m unaligned bits
This involves lots of bit shift operations.
The other one saves a little CPU time by using bitwise operations only for flag storage, whereas literal bytes, length and position are derived from aligned input bytes. To achieve this, it breaks the linearity by fetching a few flags in advance. So the algorithm is modified like this:
(...)
Read 8 flags at once by reading one byte. For each of these 8 flags:
If it's set, read literal as aligned byte
If it's not set, read length and position as aligned bytes (deriving the specific values from the fetched bytes involves some bit operations, but it's nowhere as expensive as the first version.)
My question is: are these both valid LZSS implementations, or did I identify these algorithms wrong? Are there any known names for them?
They are effectively variants on LZSS, since all use one bit to decide on literal vs. match. More generally they are variants on LZ77.
Deflate is also a variant on LZ77, which does not use a whole bit for literal vs. match. Instead deflate has a single code for the combination of literals and lengths, so the code implicitly determines whether the next thing is a literal or a match. A length code is followed by a separate distance code.
lz4 (a specific algorithm, not a family) handles byte alignment in a different way, coding the number of literals, which is necessarily followed by a match. The first byte with the number of literals also has part of the distance. The literals are byte aligned, as is the offset that follows the literals and the rest of the distance.

What checksumming technique will let me calculate the checksum of the whole from the checksums of its parts?

I'd like to send several chunks of data over the network by simply concatenating them together, and I'd like to be able to use a checksum to verify that everything made it over ok (this is mainly intended as a defensive check against bugs, not because I've seen or expect actual low-level data corruption). So I'm looking for a checksumming algorithm that will let me calculate the checksum of the whole from the checksums of the parts.
One simple example of a technique like that that I think would work is to just treat each byte of each chunk as an integer, and add all of those integers together (that, of course, wouldn't detect missing zero bytes). Another would be to just calculate the lengths of each chunk (that, of course, wouldn't detect data changes that don't cause net insertions or deletions). Another, I believe, would be to XOR all the bytes together, but this would only be able to generate 1-byte checksums (I could take the bytes 4 at a time, and XOR each of those units, but if chunk lengths aren't multiples of four, I'd have to get into messiness that I think would probably remove the ability to simply concatenate different chunks together).
So, I'm looking for a more serious checksumming or hashing algorithm that still would let me easily calculate the checksum for several concatenated chunks given the checksums of each chunk. Do any exist?
Correctly transferring messages over TCP requires a protocol. This protocol must define the start, end or preferably both, of messages. This means that you will always know the length of messages. If your protocol puts the lenth at the start and integer checksum at the end of the messages, missing zero bytes will be detected because the transmitted checksum will then be recovered from the wrong bytes in the stream and so will be wrong, on average, 65535 times out of 65536.

What is the SHA-256 hash of a single "1" bit?

The definition of SHA-256 appears to be such that the input consisting of a single "1" bit has a well-defined hash value, distinct from that of the "01" byte (since the padding is done based on input's length in bits).
However, due to endianness issues and the fact that no implementations that I can find support feeding in single bits, I can't quite figure out what this correct value is.
So, what is the correct hash of the 1-bit long input consisting of the bit "1"? (not the 8-bit long byte[] { 1 } input).
OK, according to my own implementation:
1-bit string "1":
B9DEBF7D 52F36E64 68A54817 C1FA0711 66C3A63D 384850E1 575B42F7 02DC5AA1
1-bit string "0":
BD4F9E98 BEB68C6E AD3243B1 B4C7FED7 5FA4FEAA B1F84795 CBD8A986 76A2A375
I have tested this implementation on several standard multiples-of-8-bits inputs, including the 0-bit string, and the results were correct.
(of course the point of this question was to validate the above outputs in the first place, so use with care...)
Not sure if I understand your question correctly.
SHA-256 operates with block sizes of 64 bytes (=512bits). This means smaller inputs must be padded first. The result of the padding looks like this:
For Bit 1: 1100000000000...00000000001
For Bits 01: 0110000000000...00000000010
As this results are distinct, the results of the following compression functions will be too. And therefore the hash values are. The standard document explains the padding quite descriptive: http://csrc.nist.gov/publications/fips/fips180-2/fips180-2.pdf
There is C code available in section 8 of RFC 4634 to compute the hash of data that is not necessarily a multiple of 8 bits. See the methods whose names are SHA*FinalBits(...).

Resources