Why does the 0x55555556 divide by 3 hack work?

Why does the 0x55555556 divide by 3 hack work? - algorithm

There is a (relatively) well known hack for dividing a 32-bit number by three. Instead of using actual expensive division, the number can be multiplied by the magic number 0x55555556, and the upper 32 bits of the result are what we're looking for. For example, the following C code:
int32_t div3(int32_t x)
{
return x / 3;
}
compiled with GCC and -O2, results in this:
08048460 <div3>:
8048460: 8b 4c 24 04 mov ecx,DWORD PTR [esp+0x4]
8048464: ba 56 55 55 55 mov edx,0x55555556
8048469: 89 c8 mov eax,ecx
804846b: c1 f9 1f sar ecx,0x1f
804846e: f7 ea imul edx
8048470: 89 d0 mov eax,edx
8048472: 29 c8 sub eax,ecx
8048474: c3 ret
I'm guessing the sub instruction is responsible for fixing negative numbers, because what it does is essentially add 1 if the argument is negative, and it's a NOP otherwise.
But why does this work? I've been trying to manually multiply smaller numbers by a 1-byte version of this mask, but I fail to see a pattern, and I can't really find any explanations anywhere. It seems to be a mystery magic number whose origin isn't clear to anyone, just like 0x5f3759df.
Can someone provide an explanation of the arithmetic behind this?

It's because 0x55555556 is really 0x100000000 / 3, rounded up.
The rounding is important. Since 0x100000000 doesn't divide evenly by 3, there will be an error in the full 64-bit result. If that error were negative, the result after truncation of the lower 32 bits would be too low. By rounding up, the error is positive, and it's all in the lower 32 bits so the truncation wipes it out.

Related

What does movsbl (%rax, %rcx, 1),%eax and $0xf, %eax do?

movsbl (%rax, %rcx, 1),%eax
and $0xf, %eax
I have:
%rax=93824992274748
%eax=1431693628
%rcx=0
I really don't know what the reason why I have these results:
How does the first instruction gives me %eax=97?
Why does the and between the binary representation of 97 and 1111 give me 1?

Bitwise AND compares the bits of both operands. In this case, 97 AND 15:
0110 0001 ;97
0000 1111 ;15
For each column of bits, if both bits in the column are 1, the resulting bit in that column is 1. Otherwise, it's zero.
0110 0001 ;97
0000 1111 ;15
---------------
0000 0001 ;1
You might be wondering what purpose this serves. There's a lot of things you can do with AND actually, many of which aren't obvious at first glance. It's very helpful to think of one of the operands as your data and the other as a "filter."
For example, let's say we have a function called rand that returns a random 32-bit unsigned integer in %eax every time you call it. Assume that all possible values are equally likely. Now, let's say that we have some function called myEvent, and whether we call it or not shall be based on the outcome of rand. If we want this event to have a 1 in 16 chance of occurring, we can do this:
call rand
and $0xf, %eax
jnz skip
call myEvent
skip:
The reason this works is because every multiple of 16 has the bottom 4 bits clear. So those are the only bits we're interested in, and we can use and $0xf, %eax ignore everything to the left of those 4 bits, since they'll all turn into zeroes after the and. Once we've done the and, we can then see if %eax contains zero. If it does, %eax contained a multiple of 16 prior to the and.
Here's another example. This tells you if %eax is odd or even:
and $1, %eax
jnz isOdd
This works because a number is odd if its rightmost binary digit is 1. Any time you do a n % 2 in a high-level language, the compiler replaces it with n & 1 rather than doing actual division.

How to simplify algorithm of pasting addresses of data to immediate fields in machine code?

When developing my assembler I got one problem. In assembly language we can define data values (e. g. msg db 'Hi') and paste address of this data anywhere, even above this data. However when assembling code assembler don't know address of data until is not processed the line with those data.
Of course, assembler can remember addresses of machine code where we use address of defined data and after processing a code replace values in remembered addresses to address of our data. But if we define data in 2-byted address (e. g. at 01 F1) assembler would be to replace 1 bytes in remembered addresses to 2 bytes of address (01 F1) and so immediate field size will be changed (imm8 -> imm16) and assembler shall be rewrite instruction in same address (change bits w and s at opcode and maybe set prefix 0x66). If assembler will set prefix 0x66 and our data definition be after this instruction it shall be rewrite immediate field bytes (increment address value).
Illustration of this algoritm :
The following code:
mov dh, 09h
add dx, msg
;...
msg db 'Hello$'
will be assembled in the following principle:
preparing the code:
Comment : |===> Remember address of this byte (0x0004)
Comment : | ADD DX,MSG |
Address : 0000 0001 |0002 0003 0004| ... 01F1 01F2 01F3 01F4 01F5 01F6
Code : B4 09 | 83 C2 00 | ... 48 65 6C 6C 6F 24
Comment : ---------------- H e l l o $
rewriting code in remebered addresses:
Comment : |=============|-This address (msg)
Comment : | ADD DX,01F1 | v
Address : 0000 0001 |0002 0003 0004 0005| ... 01F2 01F3 01F4 01F5 01F6 01F7
Code : B4 09 | 83 C2 F1 01 | ... 48 65 6C 6C 6F 24
Comment : --------------------- H e l l o $
rewriting instruction's opcode 83h -> 81h (10000011b -> 10000001b: bit s=0):
Comment : |=============|-This address (msg)
Comment : | ADD DX,01F1 | v
Address : 0000 0001 |0002 0003 0004 0005| ... 01F2 01F3 01F4 01F5 01F6 01F7
Code : B4 09 | 81 C2 F1 01 | ... 48 65 6C 6C 6F 24
Comment : --------------------- H e l l o $
write to immediate field the new address of data (0x01F2):
Comment : |=============|-This address (msg)
Comment : | ADD DX,01F2 | v
Address : 0000 0001 |0002 0003 0004 0005| ... 01F2 01F3 01F4 01F5 01F6 01F7
Code : B4 09 | 81 C2 F2 01 | ... 48 65 6C 6C 6F 24
Comment : --------------------- H e l l o $
I think that this algorithm is difficult. Is it possible to simplify its?

If the assembler isn't emitting a flat binary (i.e. also being a linker), the assembler has to assume the symbol address might be 2 bytes, because the final absolute address won't be known until link time, after the assembler is done. (So it will just leave space for a 2-byte address and a relocation for the linker to fill it in).
But if you are assembling directly into a flat binary and want to do this optimization, presumably you'd treat it like branch displacements with a start-small algorithm and do multi-pass optimization until everything fits. In fact you'd do this as part of the same passes that look at jmp/jcc rel8 vs. jmp/jcc rel16. (Why is the "start small" algorithm for branch displacement not optimal? - it is optimal if you don't have stuff like align 8, otherwise there are corner cases where it does ok but not optimal.)
These optimization passes are just looping over internal data-structures that represent the code, not actually writing final machine-code at each step. There's no need to calculate or look up the actual opcodes and ModRM encodings until after the last optimization pass.
You just need your optimizer to know rules for instruction size, e.g. that add reg, imm8 is 3 bytes, add reg, imm16 is 4 bytes (except for AX where add ax, imm16 has a 3-byte special encoding, same as add ax, imm8, so add to AX doesn't need to be part of the multi-pass optimization at all, it can just choose an encoding when we reach it after all symbol addresses are known.)
Note that it's much more common to use addresses as immediates for mov which doesn't allow narrow immediates at all (mov reg, imm16 is always 3 bytes). But this optimization is also relevant for disp8 vs. disp16 in addressing modes, e.g. for xor cl, [di + msg] could use reg+disp8 for small addresses so it's worth having this optimization.
So again, your optimizer passes would know that [di + disp8] takes 1 byte after the ModRM, [di + disp16] takes 2 extra.
And [msg] always takes 2 bytes after ModRM, there is no [disp8] encoding. The asm source would need to have a zeroed register if it wanted to take advantage of disp8 for small addresses.
Of course a simplistic or one-pass assembler could always just assume addresses are 16-bit and encode the other parts of the machine code accordingly, only going back to fill in numeric addresses once unresolved symbols are seen. (Or emit relocation info at the end for a linker to do it.)

Offset in function in disassembler output

Cannot find (or formulate a question to Google to find) an answer on the simple (or noob) question.
I'm inspecting an application with objdump -d tool:
. . .
5212c0: 73 2e jae 5212f0 <rfb::SMsgReaderV3::readSetDesktopSize()+0x130>
5213e8: 73 2e jae 521418 <rfb::SMsgReaderV3::readSetDesktopSize()+0x258>
521462: 73 2c jae 521490 <rfb::SMsgReaderV3::readSetDesktopSize()+0x2d0>
. . .
What does it mean the +XXXX offset in the output? How can I relate it to the source code, if possible? (Postprocessed with c++filt)

It's the offset in bytes from the beginning of the function.
Here's an example from WinDbg, but it's the same everywhere:
This is the current call stack:
0:000> k L1
# Child-SP RetAddr Call Site
00 00000000`001afcb8 00000000`77b39ece USER32!NtUserGetMessage+0xa
This is how the function looks like:
0:000> uf USER32!NtUserGetMessage
USER32!NtUserGetMessage:
00000000`77b39e90 4c8bd1 mov r10,rcx
00000000`77b39e93 b806100000 mov eax,1006h
00000000`77b39e98 0f05 syscall
00000000`77b39e9a c3 ret
And this is what the current instruction is:
0:000> u USER32!NtUserGetMessage+a L1
USER32!NtUserGetMessage+0xa:
00000000`77b39e9a c3 ret
So, the offset 0x0A is 10 bytes from the function start. 3 bytes for the first mov, 5 bytes for the second mov and 2 bytes for the syscall.
If you want to relate it to your code, it heavily depends on whether or not it was optimized.
If the offset is very high, you might not have enough symbols. E.g. with export symbols only you may see offsets like +0x2AF4 and you can't tell anything about the real function any more.

Padding the message in SHA256

I am trying to understand SHA256. On the Wikipedia page it says:
append the bit '1' to the message
append k bits '0', where k is the minimum number >= 0 such that the resulting message
length (modulo 512 in bits) is 448.
append length of message (without the '1' bit or padding), in bits, as 64-bit big-endian
integer
(this will make the entire post-processed length a multiple of 512 bits)
So if my message is 01100001 01100010 01100011 I would first add a 1 to get
01100001 01100010 01100011 1
Then you would fill in 0s so that the total length is 448 mod 512:
01100001 01100010 01100011 10000000 0000 ... 0000
(So in this example, one would add 448 - 25 0s)
My question is: What does the last part mean? I would like to see an example.

It means the message length, padded to 64 bits, with the bytes appearing in order of significance. So if the message length is 37113, that's 90 f9 in hex; two bytes. There are two basic(*) ways to represent this as a 64-bit integer,
00 00 00 00 00 00 90 f9 # big endian
and
f9 90 00 00 00 00 00 00 # little endian
The former convention follows the way numbers are usually written out in decimal: one hundred and two is written 102, with the most significant part (the "big end") being written first, the least significant ("little end") last. The reason that this is specified explicitly is that both conventions are used in practice; internet protocols use big endian, Intel-compatible processors use little endian, so if they were decimal machines, they'd write one hundred and two as 201.
(*) Actually there are 8! = 40320 ways to represent a 64-bit integer if 8-bit bytes are the smallest units to be permuted, but two are in actual use.

gcc compiled binary in x86 assembly

gcc compile binary has following assembly:
8049264: 8d 44 24 3e lea 0x3e(%esp),%eax
8049268: 89 c2 mov %eax,%edx
804926a: bb ff 00 00 00 mov $0xff,%ebx
804926f: b8 00 00 00 00 mov $0x0,%eax
8049274: 89 d1 mov %edx,%ecx
8049276: 83 e1 02 and $0x2,%ecx
8049279: 85 c9 test %ecx,%ecx
804927b: 74 09 je 0x8049286
At first glance, I had no idea what it is doing at all. My best guess is some sort of memory alignment and clearing up local variable (because rep stos is filling 0 at local variable location). If you take a look at first few lines, load address into eax and move to ecx and test if it is even address or not, but I'm lost why this is happening. I want to know what exactly happen in here.

It looks like initialising a local variable located at [ESP + 0x03e] to zeroes. At first, EDX is initialised to hold the address and EBX is initialised to hold the size in bytes. Then, it's checked whether EDX & 2 is nonzero; in other words, whether EDX as a pointer is wyde-aligned but not tetra-aligned. (Assuming ESP is tetrabyte aligned, as it generally should, EDX, which was initialised at 0x3E bytes above ESP, would not be tetrabyte aligned. But this is slightly besides the point.) If this is the case, the wyde from AX, which is zero, is stored at [EDX], EDX is incremented by two, and the counter EBX is decremented by two. Now, assuming ESP was at least wyde-aligned, EDX is guaranteed to be tetra-aligned. ECX is calculated to hold the number of tetrabytes remaining by shifting EBX right two bits, EDI is loaded from EDX, and the REP STOS stores that many zero tetrabytes at [EDI], incrementing EDI in the process. Then, EDX is loaded from EDI to get the pointer-past-space-initialised-so-far. Finally, if there were at least two bytes remaining uninitialised, a zero wyde is stored at [EDX] and EDX is incremented by two, and if there was at least one byte remaining uninitialised, a zero byte is stored at [EDX] and EDX is incremented by one. The point of this extra complexity is apparently to store most of the zeroes as four-byte values rather than single-byte values, which may, under certain circumstances and in certain CPU architectures, be slightly faster.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio