Padding the message in SHA256 - algorithm

I am trying to understand SHA256. On the Wikipedia page it says:
append the bit '1' to the message
append k bits '0', where k is the minimum number >= 0 such that the resulting message
length (modulo 512 in bits) is 448.
append length of message (without the '1' bit or padding), in bits, as 64-bit big-endian
integer
(this will make the entire post-processed length a multiple of 512 bits)
So if my message is 01100001 01100010 01100011 I would first add a 1 to get
01100001 01100010 01100011 1
Then you would fill in 0s so that the total length is 448 mod 512:
01100001 01100010 01100011 10000000 0000 ... 0000
(So in this example, one would add 448 - 25 0s)
My question is: What does the last part mean? I would like to see an example.

It means the message length, padded to 64 bits, with the bytes appearing in order of significance. So if the message length is 37113, that's 90 f9 in hex; two bytes. There are two basic(*) ways to represent this as a 64-bit integer,
00 00 00 00 00 00 90 f9 # big endian
and
f9 90 00 00 00 00 00 00 # little endian
The former convention follows the way numbers are usually written out in decimal: one hundred and two is written 102, with the most significant part (the "big end") being written first, the least significant ("little end") last. The reason that this is specified explicitly is that both conventions are used in practice; internet protocols use big endian, Intel-compatible processors use little endian, so if they were decimal machines, they'd write one hundred and two as 201.
(*) Actually there are 8! = 40320 ways to represent a 64-bit integer if 8-bit bytes are the smallest units to be permuted, but two are in actual use.

Related

character encoding - how utf-8 handles charactrers

So, I know that when we type characters, each character maps to a number in a character set and then, that number is transformed into a binary format so a computer can understand. They way that number is transformed into a binary format(how many bits gets allocated) depends on character encoding.
So, if I type L, It represents 76. Then 76 gets tranformed into 1 byte binary format because of let's say UTF-8.
Now, I've read the following somewhere:
The Devanagari character क, with code point 2325 (which is 915 in
hexadecimal notation), will be represented by two bytes when using the
UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four
bytes with UTF-32 (00 00 09 15).
So, as you can see it says three bytes with UTF-8 (E0 A4 95). how are E0 A4 95 bytes ? I am asking because i have no idea where E0 A4 95 came from... Why do we need this ? if we know that code point is 2325, all we have to do is in order to use UTF-8, we know that utf-8 will need 3 bytes to transform 2325 into binary... Why do we need E0 A4 95 and what is it ?
E0 A4 95 is the 3-byte UTF-8 encoding of U+0915. In binary:
E 0 A 4 9 5 (hex)
11100000 10100100 10010101 (binary)
1110xxxx 10xxxxxx 10xxxxxx (3-byte UTF-8 encoding pattern)
0000 100100 010101 (data bits)
00001001 00010101 (regrouped data bits to 8-bit bytes)
0 9 1 5 (hex)
U+0915 (Unicode code point)
The first byte's binary pattern 1110xxxx is a lead byte indicating a 3-byte encoding and 4 bits of data. Follow on bytes start with 10xxxxxx and provide 6 more bits of data. There will be two following bytes after a 3-byte leading byte indicator.
For more information read the Wikipedia article on UTF-8 and the standard RFC-3629.

What is the byte/bit order in this Microsoft document?

This is the documentation for the Windows .lnk shortcut format:
https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-shllink/16cb4ca1-9339-4d0c-a68d-bf1d6cc0f943
The ShellLinkHeader structure is described like this:
This is a file:
Looking at HeaderSize, the bytes are 4c 00 00 00 and it's supposed to mean 76 decimal. This is a little-endian integer, no surprise here.
Next is the LinkCLSID with the bytes 01 14 02 00 00 00 00 00 c0 00 00 00, representing the value "00021401-0000-0000-C000-000000000046". This answer seems to explain why the byte order changes because the last 8 bytes are a byte array while the others are little-endian numbers.
My question is about the LinkFlags part.
The LinkFlags part is described like this:
And the bytes in my file are 9b 00 08 00, or in binary:
9 b 0 0 0 8 0 0
1001 1011 0000 0000 0000 1000 0000 0000
^
By comparing different files I found out that the bit marked with ^ is bit 6/G in the documentation (marked in red).
How to interpret this? The bytes are in the same order as in the documentation but each byte has its bits reversed?
The issue here springs from the fact the shown list of bits in these specs is not meant to fit a number underneath it at all. It is meant to fit a list of bits underneath it, and that list goes from the lowest bit to the highest bit, which is the complete inverse of how we read numbers from left to right.
The list clearly shows bits numbered from 0 to 31, though, meaning this is indeed one 32-bit value, and not four bytes. Specifically, this means the original read bytes need to be interpreted as a single 32-bit integer before doing anything else. Like with all other values, this means it needs to be read as little-endian number, with its bytes reversed.
So your 9b 00 08 00 becomes 0008009b, or, in binary, 0000 0000 0000 1000 0000 0000 1001 1011.
But, as I said, that list in the specs shows the bits from lowest to highest. So to fit them under that, reverse the binary version:
0 1 2 3
0123 4567 8901 2345 6789 0123 4567 8901
ABCD EFGH IJKL MNOP QRST UVWX YZ#_ ____
---------------------------------------
1101 1001 0000 0000 0001 0000 0000 0000
^
So bit 6, indicated in the specs as 'G', is 0.
This whole thing makes a lot more sense if you invert the specs, though, and list the bits logically, from highest to lowest:
3 2 1 0
1098 7654 3210 9876 5432 1098 7654 3210
____ _#ZY XWVU TSRQ PONM LKJI HGFE DCBA
---------------------------------------
0000 0000 0000 1000 0000 0000 1001 1011
^
0 0 0 8 0 0 9 b
This makes the alphabetic references look a lot less intuitive, but it does perfectly fit the numeric versions underneath. The bit matches your findings (third bit on what you have as value '9'), and you can also clearly see that the highest 5 bits are unused.

Hexdump: Convert between bytes and two-byte decimal

When I use hexdump on a file with no options, I get rows of hexadecimal bytes:
cf fa ed fe 07 00 00 01 03 00 00 80 02 00 00 00
When I used hexdump -d on the same file, that same data is shown in something called two-byte decimal groupings:
64207 65261 00007 00256 00003 32768 00002 00000
So what I'm trying to figure out here is how to convert between these two encodings. cf and fa in decimal are 207 and 250 respectively. How do those numbers get combined to make 64207?
Bonus question: What is the advantage of using these groupings? The octal display uses 16 groupings of three digits, why not use the same thing with the decimal display?
As commented by #georg.
0xfa * 256 + 0xcf == 0xfacf == 64207
The conversion exactly works like this.
So, if you see man hexdump:
-d, --two-bytes-decimal
Two-byte decimal display. Display the input offset in hexadecimal, followed by eight
space-separated, five-column, zero-filled, two-byte units of input data, in unsigned
decimal, per line.
So, for example:
00000f0 64207 65261 00007 00256 00003 32768 00002 00000
Here, 00000f0 is a hexadecimal offset.
Followed by two-byte units of input data, for eg.: 64207 in decimal (first 16 bits - i.e. two bytes of the file).
The conversion (in your case):
cf fa ----> two-byte unit (the byte ordering depends on your architecture).
fa * 256 + cf = facf ----> appropriately ----> 0xfacf (re-ording)
And dec of oxfacf is 64207.
Bonus question: It is a convention to display octal numbers using three digits (unlike hex and decimal), so it uses a triplet for each byte.

Why strings are stored in the following way in PE file

I opened a .exe file and I found a string "Premium" was stored in the following way
50 00 72 00 65 00 6D 00 69 00 75 00 6D 00
I just don't know why "00" is appended to each of characters and what's its usage.
Thanks,
It's probably a UTF-16 encoding of a Unicode string. Here's an example using Python:
>>> u"Premium".encode("utf16")
'\xff\xfeP\x00r\x00e\x00m\x00i\x00u\x00m\x00'
# ^ ^ ^ ^ ^ ^ ^
After the byte marker to indicate endianness, you can see alternating letters and null bytes.
\xff\xfe is the byte-order marker; it indicates that the low-order byte of each 16-bit value comes first. (If the high-order byte came first, the byte marker would be \xfe\xff; there's nothing particularly meaningful about which marker means which.)
Each character is then encoded as a 16-bit value. For many values, the UTF-16 encoding is simply the straightforward unsigned 16-bit representation of its Unicode code point. Specifically, 8-bit ASCII values simply use a null byte as the high-order byte, and its ASCII value as the low-order byte.

Understanding .bmp file

I have a .bmp file
I sort of do understand and sort of do not understand. I understand that the first 14 bytes are my Bitmapfileheader. I furthermore do understand that my Bitmapinfoheader contains information about the bitmap as well and is about 40 bytes large (in version 3).
What I do not understand is, how the information is stored in there.
I have this image:
Why is all the colorinformation stored in "FF"? I know that the "00" are "Junk Bytes". What I do not understand is why there is everything in "FF"?!
Furthermore, I do not understand what type of "encoding" that is? 42 4D equals do "BM". What is that? How can I translate what I see there to colors or letters or numbers?!
What I can read in this case:
Bitmapfileheader:
First 2 bytes. BM if it is a .bmp file: 42 4D = BM (However 42 4D transforms to BM)
Next 4 Bytes: Size of the bitmap. BA 01 00 00. Dont know what size that should be.
Next 4 Bytes: Something reserved.
Next 4 Bytes: Offset (did not quite understand that)
Bitmapinfoheader
Next 4 Bytes: Size of the bitmapinfoheader. 6C 00 00 00 here.
Next 4 Bytes: Width of the .bmp. 0A 00 00 00. I know that that must be 10px since I created that file.
Next 4 Bytes: Height of the .bmp. 0A 00 00 00. I know that that must be 10px since I created that file.
Next 2 Bytes: Something from another file format.
Next two Bytes: Color depth. 18 00 00 00. I thought that can only by 1,2,4,8, 16, 24, 32?
The first 2 bytes of information that you see "42 4D" are what we call the magic numbers. They are the signature of the file, 42 4d is the hex notation of 01000010 01001101 in binary.
Every file has one, .jpg, .gif. You get it.
Here is an image that illustrate a BMP complete header of 54 bytes(24-bit BMP).
BMP Header
The total size of the BMP is calculated by the size of the header + BMP.width x BMP.height * 3 (1 byte for Red, 1 byte for Green, 1 byte for Blue - in the case of 8bits of information per channel) + padding(if it exists).
Junk bytes that you refer, is the padding, they are needed if the size of each scanline(row) is not a multiple of 4.
White in hex notation if ffffff, being the first two red, green and blue.
While in decimal notation each channel will have the value of 255, because 2^8(8 bits) -1 = 255.
Hope this clears a bit(non intended pun) for you.

Resources