Windows DLL user space calling process - windows

I'm trying to understand how Windows recursively loads DLLs in user space.
Tracing kernel32.ReadProcessMemory as an example:
The first step for ReadProcessMemory is the IAT of kernel32:
00007FF901F6AFA0 | 48:FF25 21D20500 | jmp qword ptr ds:[<&ReadProcessMemory>] |
Which jmp's to kernelbase.ReadProcessMemory:
00007FF9002D22F0 | 48:83EC 48 | sub rsp,48 |
00007FF9002D22F4 | 48:8D4424 30 | lea rax,qword ptr ss:[rsp+30] |
00007FF9002D22F9 | 48:894424 20 | mov qword ptr ss:[rsp+20],rax |
00007FF9002D22FE | 48:FF15 C3521400 | call qword ptr ds:[<&ZwReadVirtualMemory>] |
<snip>
Which call's ntdll.ZwReadVirtualMemory:
00007FF902F5C840 | 4C:8BD1 | mov r10,rcx |
00007FF902F5C843 | B8 3F000000 | mov eax,3F | 3F:'?'
00007FF902F5C848 | F60425 0803FE7F 01 | test byte ptr ds:[7FFE0308],1 |
00007FF902F5C850 | 75 03 | jne ntdll.7FF902F5C855 |
00007FF902F5C852 | 0F05 | syscall |
00007FF902F5C854 | C3 | ret |
00007FF902F5C855 | CD 2E | int 2E |
00007FF902F5C857 | C3 | ret |
So the flow from user mode in this example is:
kernel32.ReadProcessMemory
kernelbase.ReadProcessMemory
ntdll.ZwReadVirtualMemory
The expectation would be that each of the above DLLs can 'find' the appropriate function based on their IAT/imported functions from other DLLs when they are loaded.
Using dumpbin and tracing the IMPORTS this is true for kernel32.ReadProcessMemory (where api-ms-win-core-memory-l1-1-0.dll is an ApiSet for kernelbase.dll):
api-ms-win-core-memory-l1-1-0.dll
180078178 Import Address Table
18009E120 Import Name Table
0 time date stamp
0 Index of first forwarder reference
35 VirtualQueryEx
<snip>
1C ReadProcessMemory
However, this is not true for kernelbase.dll - NtReadVirtualMemory is imported, however ZwReadVirtualMemory is not imported:
ntdll.dll
1801A67C8 Import Address Table
180262A48 Import Name Table
0 time date stamp
0 Index of first forwarder reference
893 __C_specific_handler
<snip>
205 NtReadVirtualMemory
So, my question is: during the DLL load process, how does kernelbase.dll identify the 'location' of ZwReadVirtualMemory if it isn't imported?
The ZwReadVirtualMemory function is being called by kernelbase.dll, so it must have been resolved/stored in the IAT at some point, but how does this happen technically?
Is there some indirection where the loader maps NtReadVirtualMemory to ZwReadVirtualMemory as these functions resolve to the same address?

Related

Performance difference between two seemingly equivalent assembly codes

tl;dr: I have two functionally equivalent C codes that I compile with Clang (the fact that it's C code doesn't matter much; only the assembly is interesting I think), and IACA tells me that one should be faster, but I don't understand why, and my benchmarks show the same performance for the two codes.
I have the following C code (ignore #include "iacaMarks.h", IACA_START, IACA_END for now):
ref.c:
#include "iacaMarks.h"
#include <x86intrin.h>
#define AND(a,b) _mm_and_si128(a,b)
#define OR(a,b) _mm_or_si128(a,b)
#define XOR(a,b) _mm_xor_si128(a,b)
#define NOT(a) _mm_andnot_si128(a,_mm_set1_epi32(-1))
void sbox_ref (__m128i r0,__m128i r1,__m128i r2,__m128i r3,
__m128i* r5,__m128i* r6,__m128i* r7,__m128i* r8) {
__m128i r4;
IACA_START
r3 = XOR(r3,r0);
r4 = r1;
r1 = AND(r1,r3);
r4 = XOR(r4,r2);
r1 = XOR(r1,r0);
r0 = OR(r0,r3);
r0 = XOR(r0,r4);
r4 = XOR(r4,r3);
r3 = XOR(r3,r2);
r2 = OR(r2,r1);
r2 = XOR(r2,r4);
r4 = NOT(r4);
r4 = OR(r4,r1);
r1 = XOR(r1,r3);
r1 = XOR(r1,r4);
r3 = OR(r3,r0);
r1 = XOR(r1,r3);
r4 = XOR(r4,r3);
*r5 = r1;
*r6 = r4;
*r7 = r2;
*r8 = r0;
IACA_END
}
I was wondering if I could optimize it by manually rescheduling a few instructions (I am well aware that the C compiler should produce an efficient scheduling, but my experiments have shown that it's not always the case). At some point, I tried the following code (it's the same as above, except that no temporary variables are used to store the results of the XORs that are later assigned to *r5 and *r6):
resched.c:
#include "iacaMarks.h"
#include <x86intrin.h>
#define AND(a,b) _mm_and_si128(a,b)
#define OR(a,b) _mm_or_si128(a,b)
#define XOR(a,b) _mm_xor_si128(a,b)
#define NOT(a) _mm_andnot_si128(a,_mm_set1_epi32(-1))
void sbox_resched (__m128i r0,__m128i r1,__m128i r2,__m128i r3,
__m128i* r5,__m128i* r6,__m128i* r7,__m128i* r8) {
__m128i r4;
IACA_START
r3 = XOR(r3,r0);
r4 = r1;
r1 = AND(r1,r3);
r4 = XOR(r4,r2);
r1 = XOR(r1,r0);
r0 = OR(r0,r3);
r0 = XOR(r0,r4);
r4 = XOR(r4,r3);
r3 = XOR(r3,r2);
r2 = OR(r2,r1);
r2 = XOR(r2,r4);
r4 = NOT(r4);
r4 = OR(r4,r1);
r1 = XOR(r1,r3);
r1 = XOR(r1,r4);
r3 = OR(r3,r0);
*r7 = r2;
*r8 = r0;
*r5 = XOR(r1,r3); // This two lines are different
*r6 = XOR(r4,r3); // (no more temporary variables)
IACA_END
}
I'm compiling these codes using Clang 5.0.0 targeting my i5-6500 (Skylake), with the flags -O3 -march=native (I'm omitting the assembly code produced, as they can be found in the IACA outputs bellow, but if you'd prefer to have them directly here, ask me and I'll add them). I benchmarked those two codes and didn't find any performance difference between them. Out of curiosity, I ran IACA on them, and I was surprised to see that it said that the first version should take 6 cycles to run, and the second version 7 cycles.
Here are the output produce by IACA:
For the first version:
dada#dada-ubuntu ~/perf % clang -O3 -march=native -c ref.c && ./iaca -arch SKL ref.o
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File - ref_iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 6.00 Cycles Throughput Bottleneck: FrontEnd
Loop Count: 23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 6.0 0.0 | 6.0 | 1.3 0.0 | 1.4 0.0 | 4.0 | 6.0 | 0.0 | 1.4 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vpand xmm5, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm5, xmm5, xmm0
| 1 | | 1.0 | | | | | | | vpor xmm0, xmm3, xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm0, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm1, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm3, xmm4, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpxor xmm2, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vpcmpeqd xmm4, xmm4, xmm4
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm4
| 1 | 1.0 | | | | | | | | vpor xmm1, xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm5, xmm3
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm0, xmm3
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm4, xmm3
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm3
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdi], xmm4
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rsi], xmm1
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdx], xmm2
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rcx], xmm0
Total Num Of Uops: 26
For the second version:
dada#dada-ubuntu ~/perf % clang -O3 -march=native -c resched.c && ./iaca -arch SKL resched.o
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File - resched_iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 7.00 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 6.0 0.0 | 6.0 | 1.3 0.0 | 1.4 0.0 | 4.0 | 6.0 | 0.0 | 1.3 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vpand xmm5, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm5, xmm5, xmm0
| 1 | | 1.0 | | | | | | | vpor xmm0, xmm3, xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm0, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm1, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm3, xmm4, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpxor xmm2, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vpcmpeqd xmm4, xmm4, xmm4
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm4
| 1 | 1.0 | | | | | | | | vpor xmm1, xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm5, xmm3
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm0, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdx], xmm2
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.4 | vmovdqa xmmword ptr [rcx], xmm0
| 1 | 1.0 | | | | | | | | vpxor xmm0, xmm4, xmm3
| 1 | | 1.0 | | | | | | | vpxor xmm0, xmm0, xmm1
| 2^ | | | 0.4 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdi], xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm1, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rsi], xmm0
Total Num Of Uops: 26
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.
As you can see, on the second version, IACA says that the bottleneck is the backend and that "Backend allocation was stalled due to unavailable allocation resources".
Both assembly codes contain the same instructions, and the only differences are the scheduling of the last 7 instructions, as well as the registers they use.
The only thing I can think of that would explain why the second code is slower is the fact that it writes twice xmm0 in the last 4 instructions, thus introducing a dependency. But since those writes are independent, I would expect the CPU to use different physical registers for them. However, I can't really prove that theory. Also, if using twice xmm0 like that were an issue, I would expect Clang to use a different register for one of the instructions (in particular since the register pressure here is low).
My question: is the second code supposed to be slower (based on the assembly code), and why?
Edit: IACA traces:
First version: https://pastebin.com/qGXHVW6a
Second version: https://pastebin.com/dbBNWsc2
Note: the C codes are implementations of Serpent cipher's first S-box, computed by Osvik here.
Figuring out why the second code is backend-bound requires some amount of manual analysis because the output emitted by IACA is too raw, although extremely rich in information. Note that the traces emitted by IACA are particularly useful for analyzing loops They can be also useful for understanding how straight-line sequences of instructions get executed (which is not as useful), but the emitted traces need to be interpreted differently. Throughput the rest of this answer, I will present my analysis for loop scenario, which is more difficult to do.
The fact that you emitted the traces without putting the code in a loop affects the following things:
the compiler couldn't inline and optimize away the stores to the output operands. They wouldn't appear at all in a real loop, or if chaining this to a different S-box.
the data dependencies from outputs to inputs happens by coincidence as the compiler used xmm0..3 to prepare data to be stored, not as consequence of choosing which output to feed back into which input of the same S-box.
the vpcmpeqd that creates an all-ones vector (for NOT) would be hoisted out of the loop after inlining.
There would be a dec/jnz or equivalent loop overhead (which can macro-fused into a single uop for port 6).
But you've asked IACA to analyze this exact block of asm as if it were run in a loop. So to explain the results, that's how we'll think of it (even though it's not what you'd get from a C compiler if you used this function in a loop).
A jmp or dec/jnz at the bottom to make this a loop is not a problem in this case: It will always get executed on port 6, which is not used by any vector instruction. This means that the jump instruction will not contend on port 6 and will not consume scheduler uop bandwidth that would otherwise have been used by other instructions. However, this can impact the resource allocator bandwidth in the issue/rename stage (which is no more than 4 fused domain uops per cycle), but this is not important in this particular case as I will discuss.
Let's first examine the port pressure ASCII figure:
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vpand xmm5, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm5, xmm5, xmm0
| 1 | | 1.0 | | | | | | | vpor xmm0, xmm3, xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm0, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm1, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm3, xmm4, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpxor xmm2, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vpcmpeqd xmm4, xmm4, xmm4
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm4
| 1 | 1.0 | | | | | | | | vpor xmm1, xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm5, xmm3
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm0, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdx], xmm2
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.4 | vmovdqa xmmword ptr [rcx], xmm0
| 1 | 1.0 | | | | | | | | vpxor xmm0, xmm4, xmm3
| 1 | | 1.0 | | | | | | | vpxor xmm0, xmm0, xmm1
| 2^ | | | 0.4 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdi], xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm1, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rsi], xmm0
The total number of fused domain uops is 22. 6 different uops have been assigned to each of port 0, 1, and 5. The other 4 uops each consists of an STD and STA uops. STD requires port 4. This assignment is reasonable. If we ignore all data dependencies, it appears the scheduler should be able to dispatch at least 3 fused domain uops every cycle. However, there can be serious contention at port 4, which may lead to filling up the reservation station. According to IACA, that is not the bottleneck in this code. Note that if the scheduler could somehow achieve a throughput that is equal to the maximum throughput of the allocator, then the code could only be frontend-bound. Obviously, this is not the case here.
The next step is to carefully examine the IACA trace. I made the following data flow graph based on the trace, which is easier to analyze. The horizontal yellow lines divide the graph according to which uops get allocated in the same cycle. Note that IACA always assumes perfect branch prediction. Also note that this division is about 99% accurate, but not 100%. This is not important and you can just consider it 100% accurate. The nodes represent fused uops and the arrows represent data dependence (where the arrow points to the destination uop). Nodes are colored depending on which loop iteration they belong to. The sources of the arrows at the top of the graph are omitted for clarity. The green boxes on the right contain the cycle number at which allocation is performed for the corresponding uops. So the previous cycle is X, and the current cycle is X + 1, whatever X is. The stop signs indicate that the associated uop suffers contention at one of the ports. All the red stop signs represent contention on port 1. There is only one other stop sign of different color that represents contention on port 5. There are are cases of contention, but I'll omitted them for clarity. Arrows come in two colors: blue and red. The ones are the critical ones. Note that it takes 11 cycles to allocate 2 iterations worth of instructions, and then the allocation pattern repeats. Keep in mind that Skylake has 97 RS entires.
The location of a node within each division (the "local" location) has a meaning. If two nodes are on the same row and if all of their operands are available, then it means that they can be dispatched in the same cycle. Otherwise, if the nodes are not on the same row, then they may not be dispatched in the same cycle. This only applies to dynamic uops that have been allocated together as a group and not to dynamic uops allocated as part of different groups even if they happen to be in the same division in the graph.
I'll use the notation (it, in) to identify a specific fused uop, where it is a zero-based loop iteration number and in is a zero-based uop number. The most important part of the IACA trace is the one that shows the pipeline stages for (11, 5):
11| 5|vpxor xmm0, xmm0, xmm1 : | | | | | | | | | | | | | |
11| 5| TYPE_OP (1 uops) : | | | | | |_A--------------------dw----R-------p | | | | |
This tells us that the allocation bandwidth is underutilized at this point due to unavailable resources (in this case, an entry in the reservation station). This means that the scheduler was not able to sustain a high enough throughput of unfused uops to keep up with the front-end 4 fused uops per cycle. Since IACA has already told us that the code is backend-bound, then obviously the reason for this underutilization is not because of some long dependency chain or contention at specific execution units, but rather something more complicated. So we need to do more work to figure out what's going on. We have to analyze past (11, 5).
The uops 1, 4, 7, 10, 13, 18 of every iteration are all assigned to port 1. What happens during a period of 11 cycles? There are a total of 12 uops that require port 1, so it's impossible to dispatch all of them in 11 cycles because it will take at least 12 cycles. Unfortunately, data dependencies within the uops that require the same port and across uops that require other ports exacerbate the problem significantly. Consider the following pipeline flow during an 11-cycle period:
At cycle 0: (0, 0) and (0, 1) get allocated (along with other uops that we don't care about right now). (0, 1) is data-dependent on (0, 0).
1: (0, 4) and (0, 7) get allocated. Assuming that no older and ready uops is assigned to port 0 and that the operands of (0, 0) are ready, dispatch (0, 0) to port 0. Port 1 potentially remains idle because (0, 1) is not ready yet.
2: The result of (0, 0) is available through the the bypass network. At this point, (0, 1) can and will be dispatched. However, even if (0, 4) or (0, 7) are ready, neither is the oldest uop assigned to port 1, so it both get blocked. (0, 10) gets allocated.
3: (0, 4) is dispatched to port 1. (0, 7) and (0, 10) both get blocked even if their operands are ready. (0, 13) gets allocated.
4: (0, 7) is dispatched to port 1. (0, 10) gets blocked. (0, 13) has to wait for (0, 7). (0, 18) gets allocated.
5: (0, 10) is dispatched to port 1. (0, 13) gets blocked. (0, 18) has to wait for (0, 17) which depends on (0, 13). (1, 0) and (1, 1) get allocated.
6: (0, 13) is dispatched to port 1. (0, 18) has to wait for (0, 17) which depends on (0, 13). (1, 1) has to wait for (1, 0). (1, 0) cannot be dispatched because the distance between (1, 0) and (0, 7) is 3 uops, one of which may suffer a port conflict. (1, 4) gets allocated.
7: Nothing gets dispatched to port 1 because (0, 18), (1, 1), and (1, 4) are not ready. (1, 7) gets allocated.
8: Nothing gets dispatched to port 1 because (0, 18), (1, 1), (1, 4), and (1, 7) are not ready. (1, 10) and (1, 13) get allocated.
9: (0, 18) is dispatched to port 1. (1, 10) and (1, 4) are ready but get blocked due to port contention. (1, 1), (1, 7), and (1, 13) are not ready.
10: (1, 1) is dispatched to port 1. (1, 4), (1, 7), and (1, 10) are ready but get blocked due to port contention. (1, 13) is not ready. (1, 18) gets allocated.
Well, ideally, we'd like 11 of the 12 uops to be dispatched to port 1 in 11 cycles. But this analysis shows that the situation is far from ideal. Port 1 is idle for 4 out of the 11 cycles! If we assume that some (X, 18) from a previous iteration gets dispatched at cycle 0, then port 1 would be idle for 3 cycles, which is a lot of waste, considering that we have 12 uops that require it every 11 cycles. Out of the 12 uops, only up to 8 got dispatched. How bad can the situation get? We can continue analyzing the trace and record how the number of p1-bound uops that are either ready to be dispatched but blocked due to conflict, or are not ready due to data decencies. I was able to determine that that the number of p1-bound uops stalled due to port conflict is never larger than 3. However, the number of p1-bound uops stalled due due to data decencies is overall increasing gradually with time. I did not see any pattern the way it increases, so I decided to use linear regression on the first 24 cycles of the trace to predict at what point there would be 97 such uops. The following figure shows that.
The x-axis represents the zero-based cycles increasing from left to right. Note that the number of uops is zero for the first 4 cycles. The y-axis represents the number of such uops at the corresponding cycle. The linear regression equation is:
y = 0.3624x - 0.6925.
By setting y to 97 we get:
x = (97 + 0.6925) / 0.3624 = 269.57
That is, at about cycle 269, we expect that there are 97 uops in the RS all p1-bound and waiting for their operands to become ready. It is at this point the RS is full. However, there can be other uops that are waiting in the RS for other reasons. So we expect that the allocator underutilize its bandwidth at or before cycle 269. by looking at the IACA trace for instruction (11, 5), we can see that the situation happens at cycle 61, which is much earlier than 269. This means that either my predictor is very optimistic or that the counts of uops bound to other ports exhibit also a similar behavior. My guts tell me it's the latter. But that is good enough to understand why IACA has said that the code is backend-bound. You can perform a similar analysis on the first code to understand why it's frontend-bound. I guess I'll just leave as an exercise for the reader.
This manual analysis can be followed in case IACA does not support a particular piece of code or when a tool like IACA does not exist for a particular microarhcitecture. The linear regression model enables to estimate after how many iterations the allocator underutilizes its bandwidth. For example in this case, cycle 269 corresponds to iteration 269/11/2 = 269/22 = 12. So as long as the maximum number of iterations is not much larger than 12, the backend performance of the loop would be less of an issue.
There is a related post by #Bee: How are x86 uops scheduled, exactly?.
I may post the details of what happens during the first 24 cycles later.
Side note: There are two errors in Wikichip's article on Skylake. First, Broadwell's scheduler has 60 entires, not 64. Second, the allocator's throughput is up to 4 fused uops only.
I benchmarked those two codes and didn't find any performance difference between them.
I did the same thing on my Skylake i7-6700k, actually benchmarking what you told IACA to analyze, by taking that asm and slapping a dec ebp / jnz .loop around it.
I found sbox_ref runs at ~7.50 cycles per iteration, while sbox_resched runs at ~8.04 c/iter, tested in a static executable on Linux, with performance counters. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details of my test methodology).
IACA's numbers are wrong, but it is correct that sbox_resched is slower.
Hadi's analysis appears correct: the dependency chains in the asm are long enough that any resource conflicts in uop scheduling will cause the back-end to lose throughput that it can never catch up from.
Presumably you benched by letting a C compiler inline that function into a loop, with local variables for the output operands. That will change the asm significantly (these are the reverse of the bullet points I edited into #Hadi's answer before writing my own):
Instead of happening by accident as the compiler uses xmm0..3 as scratch registers late in the function, the data dependencies from outputs to inputs are visible to the compiler so it can schedule appropriately. Your source code will choose which output to feed back into which input of the same S-box.
Or the deps don't exist (if you use constant inputs and avoid having the loop optimize away using volatile or an empty inline asm statement).
The stores to the output operands optimize away, like would happen for real if chaining this to a different S-box.
the vpcmpeqd that creates an all-ones vector (for NOT) would be hoisted out of the loop after inlining.
As Hadi says, the 1 uop macro-fused dec/jnz loop overhead doesn't compete for vector ALUs, so it itself isn't important. What is critically important is that slapping an asm loop around something the compiler didn't optimize as a loop body unsurprisingly gives silly results.

Several int3 in a row

I'm using x64dbg to inspect the disassembly of a .DLL.
At several points in the assembly I see several Int3 instructions in a row.
00007FFA24BF1638 | CC | int3 |
00007FFA24BF1639 | CC | int3 |
00007FFA24BF163A | CC | int3 |
00007FFA24BF163B | CC | int3 |
00007FFA24BF163C | CC | int3 |
00007FFA24BF163D | CC | int3 |
00007FFA24BF163E | CC | int3 |
00007FFA24BF163F | CC | int3 |
This instruction is used for debugging / break points right? So then why are there so many in a row, and why is there any at all considering this DLL was compiled with a release configuration VC++.
It's probably just padding, they won't ever be executed. I assume the next function begins at 00007FFA24BF1640 which is 16 byte aligned, and the preceding function presumably ends before these instructions.

Mifare 1k value block operations

I have a problem with a Mifare Standard 1k card. I made a value block (00000001FFFFFFFE0000000100FF00FF - valid?) on the data block with address 62. The value of the value block is supposed to be 1, and address of the value block is 0.
I've changed the access bits for the data block 2 to be:
C1=1
C2=1
C3=0
The other 2 data blocks have factory access bits. Access bits for the sector trailer are also changed and are:
C1=0
C2=1
C3=1
So, access bits for the corresponding sector (16th sector) are 3B478C69 (valid?).
The problem is that I can't do any of the value block specific functions on that block (increment, decrement, etc), I always get 6A81 as response -> "Card is blocked or command not supported".
The APDU I'm using is FFF5C13E0400000001.
OMNIKEY readers have extensions to the PC/SC API for contactless memory cards. The commands defined by these extensions for increment and decrement of MIFARE Classic value blocks are:
Increment:
+------+------+------+------+------+-------------+
| CLA | INS | P1 | P2 | Lc | DATA |
+------+------+------+------+------+-------------+
| 0xFF | 0xD4 | BLOCK# | 0x04 | XX 00 00 00 |
+------+------+------+------+------+-------------+
or (depending on the firmware version???) the same command with a 1-byte data field:
+------+------+------+------+------+----+
| 0xFF | 0xD4 | BLOCK# | 0x01 | XX |
+------+------+------+------+------+----+
Decrement:
+------+------+------+------+------+-------------+
| CLA | INS | P1 | P2 | Lc | DATA |
+------+------+------+------+------+-------------+
| 0xFF | 0xD8 | BLOCK# | 0x04 | XX 00 00 00 |
+------+------+------+------+------+-------------+
or (depending on the firmware version???) the same command with a 1-byte data field:
+------+------+------+------+------+----+
| 0xFF | 0xD8 | BLOCK# | 0x01 | XX |
+------+------+------+------+------+----+
BLOCK#:
P1 is the MSB of the block number (always zero) and P2 is the LSB of the block number.
XX:
The increment/decrement value.
The commands are documented in OMNIKEY Contactless Smart Card Readers Developer Guide.
It seems as if both commands implicitly issue a transfer command to commit the operation. A restore command is not documented for the PC/SC extensions, however, the restore command is available through the OMNIKEY synchronous API.

read a character from a string via inline assembler in visual studio

I'm having a problem with the visual studio inline assembler and it doesn't seem to be loading the right values. I declare a C-string like so:
const char* str = "1235";
Then I access the string like so in asm:
movzx ebx, byte ptr str[esi]
the problem is that instead of loading 49 into ebx (ascii code for '1') it loads 0. I'm sure that esi is the right index because of extensive debugging. am I loading the string wrong?
The instruction
movzx ebx, byte ptr str[esi]
says to read the byte that is esi bytes after the start of str. It is the assembly language equivalent of
ebx = *(unsigned char*)(&str) + esi);
In your case, str is a pointer variable, so str[0] is the low 8 bits of the pointer, str[1] is bits 8-15, str[2] is bits 16-23, and str[3] is bits 24-31. Any value larger than 3 is a buffer overflow because you are reading memory that is not part of str.
What you actually want to do to load the str variable into a register, and then do indexed addressing from that register.
mov ebx, str
movzx ebx, byte ptr [ebx+esi]
In pictures:
+----+
1003 | 12 |
+----+
1002 | 34 |
+----+
1001 | 56 |
+----+
str 1000 | 78 |
+----+
+----+
1234567D | 00 |
+----+
1234567C | 45 |
+----+
1234567B | 44 |
+----+
1234567A | 43 |
+----+
12345679 | 42 |
+----+
12345678 | 41 |
+----+
You are trying to read byte str+esi, which reads part of the variable str. If esi is greater than 3, you are reading past the end of str.
What you actually want to do is read the thing that str points to. This means that you need to load str into a register (I chose ebx), then add esi to the result, then access the byte there.
Try:
movzx ebx, byte ptr str[esi]

Debugging assembly to find a static pointer for reference to a value in a game

I previously asked a question on here, but I was unregistered and wasn't able to edit my entry (not sure if you can) or add any information about the issue. I'm going to attempt to be more thorough this time so I can hopefully get an answer...
I'm trying to find a static pointer and a list of offsets so that I can easily find information in a game every time it's restarted. I've been successful with every piece of information but one...
Currently, I'm using CheatEngine to help me debug and find the pointer paths needed.
The address of the value I want (which changes with each game start) is currently: 849576A For reference, this is the first inventory slot of my first character. I know that each slot is offset by 20h and each character by 550h. So character two's first inventory slot is 849576A+550h. Again, these addresses change each restart, but the offsets do not.
Using CE, I can see what access this address... it returns the following opcodes:
These two are returned before doing anything in-game:
004b7ef9 - 0f bf 08 - movsx ecx,word ptr [eax]
004b542b - 0f bf 04 0a - movsx eax,word ptr [edx+ecx]
Then when shifting items in my inventory I get these:
74be5008 - 72 2a - jb memcpy+84
004bfc3a - 0f bf 4c 02 60 - movsx ecx,word ptr [edx+eax+60]
004bf43f - 8d 7d 9c - lea edi,[ebp-64]
I'm unsure of which to use, so I just pick one and set a breakpoint on one of them, I chose 004b542b, here's the complete code section:
004B53F0 | 55 | PUSH EBP |
004B53F1 | 8BEC | MOV EBP, ESP |
004B53F3 | 83EC 0C | SUB ESP, C |
004B53F6 | 894D F4 | MOV DWORD PTR [EBP-C], ECX |
004B53F9 | C745 FC 00000000 | MOV DWORD PTR [EBP-4], 0 |
004B5400 | 837D 08 00 | CMP DWORD PTR [EBP+8], 0 |
004B5404 | 7F 04 | JG 004B540A |
004B5406 | 33C0 | XOR EAX, EAX |
004B5408 | EB 43 | JMP 004B544D |
004B540A | C745 F8 0F000000 | MOV DWORD PTR [EBP-8], F |
004B5411 | EB 09 | JMP 004B541C |
004B5413 | 8B45 F8 | MOV EAX, DWORD PTR [EBP-8] |
004B5416 | 83C0 01 | ADD EAX, 1 |
004B5419 | 8945 F8 | MOV DWORD PTR [EBP-8], EAX |
004B541C | 837D F8 19 | CMP DWORD PTR [EBP-8], 19 |
004B5420 | 7D 28 | JGE 004B544A |
004B5422 | 8B4D F8 | MOV ECX, DWORD PTR [EBP-8] |
004B5425 | C1E1 05 | SHL ECX, 5 |
004B5428 | 8B55 F4 | MOV EDX, DWORD PTR [EBP-C] |
004B542B | 0FBF040A | MOVSX EAX, WORD PTR [EDX+ECX] |
004B542F | 3B45 08 | CMP EAX, DWORD PTR [EBP+8] |
004B5432 | 75 14 | JNZ 004B5448 |
004B5434 | 8B4D F8 | MOV ECX, DWORD PTR [EBP-8] |
004B5437 | C1E1 05 | SHL ECX, 5 |
004B543A | 8B55 F4 | MOV EDX, DWORD PTR [EBP-C] |
004B543D | 0FBF440A 02 | MOVSX EAX, WORD PTR [EDX+ECX+2] |
004B5442 | 0345 FC | ADD EAX, DWORD PTR [EBP-4] |
004B5445 | 8945 FC | MOV DWORD PTR [EBP-4], EAX |
004B5448 | EB C9 | JMP 004B5413 |
004B544A | 8B45 FC | MOV EAX, DWORD PTR [EBP-4] |
004B544D | 8BE5 | MOV ESP, EBP |
004B544F | 5D | POP EBP |
004B5450 | C2 0400 | RETN 4 |
I decide to set a breakpoint so I can see the register values before and after the line that supposedly accesses my value (004B542B | 0FBF040A | MOVSX EAX, WORD PTR [EDX+ECX]).
BEFORE:
EAX: 00000000
EBX: 00000000
ECX: 000001E0
EDX: 0849558C
ESI: 000000D0
EDI: 013A38A8
EBP: 00189CE0
ESP: 00189CD4
EIP: 004B542B
AFTER:
EAX: 00000DAD
EBX: 00000000
ECX: 000001E0
EDX: 0849558C
ESI: 000000D0
EDI: 013A38A8
EBP: 00189CE0
ESP: 00189CD4
EIP: 004B542F
To me, this means EDX 0849558C should be the value I'm looking to find and then apply an offset of 1E0. However. When searching memory for hex values matching EDX, I get no results which means there aren't any pointers to that address.
I've used the same methods I'm attempting to use here, to successfully gather each static address then apply the offsets. For example, here's the static address + offsets to find my health: 01263FC8 +284 +C +30 +90
I've finally figured it out. Unfortunately, the debugging was leading me nowhere so I started looking at the pointers I had found previously for my characters. Particularly health and mana as these were closest to the addresses I was getting for my inventory. I did some math based on the address I was trying to find and pointer closest to my health and I found an offset. Using that offset and the same static pointer I had found for my health I was able to find my inventory each time.

Resources