What is the SHA-256 hash of a single "1" bit? - algorithm

The definition of SHA-256 appears to be such that the input consisting of a single "1" bit has a well-defined hash value, distinct from that of the "01" byte (since the padding is done based on input's length in bits).
However, due to endianness issues and the fact that no implementations that I can find support feeding in single bits, I can't quite figure out what this correct value is.
So, what is the correct hash of the 1-bit long input consisting of the bit "1"? (not the 8-bit long byte[] { 1 } input).

OK, according to my own implementation:
1-bit string "1":
B9DEBF7D 52F36E64 68A54817 C1FA0711 66C3A63D 384850E1 575B42F7 02DC5AA1
1-bit string "0":
BD4F9E98 BEB68C6E AD3243B1 B4C7FED7 5FA4FEAA B1F84795 CBD8A986 76A2A375
I have tested this implementation on several standard multiples-of-8-bits inputs, including the 0-bit string, and the results were correct.
(of course the point of this question was to validate the above outputs in the first place, so use with care...)

Not sure if I understand your question correctly.
SHA-256 operates with block sizes of 64 bytes (=512bits). This means smaller inputs must be padded first. The result of the padding looks like this:
For Bit 1: 1100000000000...00000000001
For Bits 01: 0110000000000...00000000010
As this results are distinct, the results of the following compression functions will be too. And therefore the hash values are. The standard document explains the padding quite descriptive: http://csrc.nist.gov/publications/fips/fips180-2/fips180-2.pdf

There is C code available in section 8 of RFC 4634 to compute the hash of data that is not necessarily a multiple of 8 bits. See the methods whose names are SHA*FinalBits(...).

Related

maximum field number in protobuf message

The official document for protocol buffers https://developers.google.com/protocol-buffers/docs/proto3 says the maximum field number for fields in protobuf message is 2^29-1. But why is this limit?
Please anyone can explain in some detail? I am newbie to this.
I read answers to the this question at why 2^29-1 is the biggest key in protocol buffers.
But I am not clarified
Each field in an encoded protocol buffer has a header (called key or tag) prefixed to the actual encoded value. The encoding spec defines this key:
Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.
Here the spec says the tag is a varint where the first 3 bits are used to encode the wire type. A varint could encode a 64 bit value, thus just by going on this definition the limit would be 2^61-1.
In addition to this, the Language Guide narrows this down to a 32 bit value at max.
The smallest field number you can specify is 1, and the largest is 2^29 - 1, or 536,870,911.
The reasons for this are not given. I can only speculate for the reasons behind this:
Artificial limit as no one is expecting a message to have that many fields. Just think about fitting a message with that many fields into memory.
As the key is a varint, it isn't simply the next 4 bytes in the raw buffer, rather a variable length of bytes (Java code reading a varint32). Each byte has 7 bit of actual data and 1 bit indicating if the end is reached. It cloud be that for performance reasons it was deemed to be better to limit the range.
Since proto3 is the 3rd version of protocol buffers, it could be that either proto1 or proto2 defined the tag to be a varint32. To keep backwards compatibility this limit is still true in proto3 today.
Because of this line:
#define GOOGLE_PROTOBUF_WIRE_FORMAT_MAKE_TAG(FIELD_NUMBER, TYPE) \
static_cast<uint32>((static_cast<uint32>(FIELD_NUMBER) << 3) | (TYPE))
this line create a "tag", which left only 29 (32 - 3) bits to save field indice.
Don't know why google use uint32 instead of uint64 though, since field number is a varint, may be they think 2^29-1 fields is large enough for a single message declaration.
I suspect this is simply so that a field-header (wire-type and tag-number) can be decoded and handled as a 32-bit value. The wire-type is always the 3 least significant bits, leaving 29 bits for the tag number. Technically "varint" should support 64 bits, but it makes sense to limit it to reasonable numbers, not least because "varint" encoding means that larger numbers take more bytes to encode.
Edit: I realise now that this is similar to the linked post, but... it remain true! Each field in protobuf is prefixed by a "varint" that expresses what field (tag-number) follows, and what data type it is (wire-type). The latter is important especially so that unexpected fields (version differences) can be stored or skipped correctly. It is convenient for that field-header to be trivially processed by most frameworks, and most frameworks are fine with 32-bit integers.
this is another question rather a comment, in the document it says,
Field numbers in the range 16 through 2047 take two bytes. So you
should reserve the numbers 1 through 15 for very frequently occurring
message elements. Remember to leave some room for frequently occurring
elements that might be added in the future.
Because for the first byte, top 5 bits are used for field number, and bottom 3 bits for field type, isn't it that field number from 31 (because zero is not used) to 2047 take two bytes? (and I also guess the second bytes' lower 3 bits are used also for field type.. I'm in the middle of reading it, so I'll fix it when I know it)

Bitmasking--when to use hex vs binary

I'm working on a problem out of Cracking The Coding Interview which requires that I swap odd and even bits in an integer with as few instructions as possible (e.g bit 0 and 1 are swapped, bits 2 and 3 are swapped, etc.)
The author's solution revolves around using a mask to grab, in one number, the odd bits, and in another num the even bits, and then shifting them off by 1.
I get her solution, but I don't understand how she grabbed the even/odd bits. She creates two bit masks --both in hex -- for a 32 bit integer. The two are: 0xaaaaaaaa and 0x55555555. I understand she's essentially creating the equivalent of 1010101010... for a 32 bit integer in hexadecimal and then ANDing it with the original num to grab the even/odd bits respectively.
What I don't understand is why she used hex? Why not just code in 10101010101010101010101010101010? Did she use hex to reduce verbosity? And when should you use one over the other?
It's to reduce verbosity. Binary 10101010101010101010101010101010, hexadecimal 0xaaaaaaaa, and decimal 2863311530 all represent exactly the same value; they just use different bases to do so. The only reason to use one or another is for perceived readability.
Most people would clearly not want to use decimal here; it looks like an arbitrary value.
The binary is clear: alternating 1s and 0s, but with so many, it's not obvious that this is a 32-bit value, or that there isn't an adjacent pair of 1s or 0s hiding in the middle somewhere.
The hexadecimal version takes advantage of chunking. Assuming you recognize that 0x0a == 0b1010, you can mentally picture the 8 groups of 1010 in the assumed value.
Another possibility would be octal 25252525252, since... well, maybe not. You can see that something is alternating, but unless you use octal a lot, it's not clear what that alternating pattern in binary is.

Are both of these algorithms valid implementations of LZSS?

I am reverse engineering things and I often stumble upon various decompression algorithms. Most of times, it's LZSS just like Wikipedia describes it:
Initialize dictionary of size 2^n
While output is less than known output size:
Read flag
If the flag is set, output literal byte (and append it at the end of dictionary)
If the flag is not set:
Read length and look behind position
Transcribe length bytes from the dictionary at look behind position to the output and at the end of dictionary.
The thing is that the implementations follow two schools of how to encode the flag. The first one treats the input as sequence of bits:
(...)
Read flag as one bit
If it's set, read literal byte as 8 unaligned bits
If it's not set, read length and position as n and m unaligned bits
This involves lots of bit shift operations.
The other one saves a little CPU time by using bitwise operations only for flag storage, whereas literal bytes, length and position are derived from aligned input bytes. To achieve this, it breaks the linearity by fetching a few flags in advance. So the algorithm is modified like this:
(...)
Read 8 flags at once by reading one byte. For each of these 8 flags:
If it's set, read literal as aligned byte
If it's not set, read length and position as aligned bytes (deriving the specific values from the fetched bytes involves some bit operations, but it's nowhere as expensive as the first version.)
My question is: are these both valid LZSS implementations, or did I identify these algorithms wrong? Are there any known names for them?
They are effectively variants on LZSS, since all use one bit to decide on literal vs. match. More generally they are variants on LZ77.
Deflate is also a variant on LZ77, which does not use a whole bit for literal vs. match. Instead deflate has a single code for the combination of literals and lengths, so the code implicitly determines whether the next thing is a literal or a match. A length code is followed by a separate distance code.
lz4 (a specific algorithm, not a family) handles byte alignment in a different way, coding the number of literals, which is necessarily followed by a match. The first byte with the number of literals also has part of the distance. The literals are byte aligned, as is the offset that follows the literals and the rest of the distance.

What is the range of possible sha1hash results?

What is the lowest and highest possible returns from sha1? (with respect that sha1 results are actualy 5 32 bit values rather than 1 true 160 bit value)
To create a secure hash the output of the hash must be indistinguishable from random. Many pseudo random number generators and key derivation methods actually use a hash as final calculation.
So the "highest" result consists of all zero's, the lowest consists of all ones. That is, if you interpret the result to be an unsigned integer of course. The chances of exactly getting those values is almost zero of course, as SHA-1 results should be evenly distributed. But the change of a number starting with 8 ones is still 1/2^8 == 1/256, which is certainly not insignificant.
Note that the result of SHA-1 should be interpreted as a bit string. Most runtimes don't have a very useful bitstring representation and use an octet string (aka byte array) instead. I would consider it very annoying of a SHA-1 implementation would return shorts instead of bytes. You don't want to annoy the user with differences in little-endian and big-endian representations, and most other primitives do expect their input represented as bytes.

How does the md5 hashing algorithm compress data to a fixed length?

I know that MD5 produces a 128-bit digest. My question is, how does it produce this fixed length output from a message of 128bits+?
EDIT:
I have now a greater understanding of hashing functions now. After reading this article I have realized that hash functions are one-way, meaning that you can't convert the hash back to plaintext. I was under the misimpression that you could due to all the online services converting them back to strings, but I have realised that thats just rainbow tables (collections of string's mapped to pre-computed hashes).
When you generate an MD5 hash, you're not compressing the input data. Compression implies that you'll be able to uncompress it back to it's original state. MD5, on the other hand, is a one-way process. This is why it's used for password storage; you ideally have to know the original input string to be able to generate the same MD5 result again.
This page provides a nice graphic-equipped explanation of MD5 and similar hash functions, and how they're used: An Illustrated Guide to Cryptographic Hashes
Consider something like starting with a 128-bit value, and taking input 128 bits at a time, and XORing each of those input blocks with the existing value.
MD5 is considerably more complex than that, but the general idea is the same: input is processed 128 bits at a time. Each input block can change the value of the result, but has no effect on the length.
It has noting (or, better, few) to do with compression. There is an algorithm which produces for every initial state and byte a new state. This state is more or less unique to this combination of inputs.
In short, it will split into many parts and do operation.
If you are wonder about the collsion, consider your message is only Readable.
The bit space is much bigger than readable char space.

Resources