Performance difference between two seemingly equivalent assembly codes

Performance difference between two seemingly equivalent assembly codes - performance

tl;dr: I have two functionally equivalent C codes that I compile with Clang (the fact that it's C code doesn't matter much; only the assembly is interesting I think), and IACA tells me that one should be faster, but I don't understand why, and my benchmarks show the same performance for the two codes.
I have the following C code (ignore #include "iacaMarks.h", IACA_START, IACA_END for now):
ref.c:
#include "iacaMarks.h"
#include <x86intrin.h>
#define AND(a,b) _mm_and_si128(a,b)
#define OR(a,b) _mm_or_si128(a,b)
#define XOR(a,b) _mm_xor_si128(a,b)
#define NOT(a) _mm_andnot_si128(a,_mm_set1_epi32(-1))
void sbox_ref (__m128i r0,__m128i r1,__m128i r2,__m128i r3,
__m128i* r5,__m128i* r6,__m128i* r7,__m128i* r8) {
__m128i r4;
IACA_START
r3 = XOR(r3,r0);
r4 = r1;
r1 = AND(r1,r3);
r4 = XOR(r4,r2);
r1 = XOR(r1,r0);
r0 = OR(r0,r3);
r0 = XOR(r0,r4);
r4 = XOR(r4,r3);
r3 = XOR(r3,r2);
r2 = OR(r2,r1);
r2 = XOR(r2,r4);
r4 = NOT(r4);
r4 = OR(r4,r1);
r1 = XOR(r1,r3);
r1 = XOR(r1,r4);
r3 = OR(r3,r0);
r1 = XOR(r1,r3);
r4 = XOR(r4,r3);
*r5 = r1;
*r6 = r4;
*r7 = r2;
*r8 = r0;
IACA_END
}
I was wondering if I could optimize it by manually rescheduling a few instructions (I am well aware that the C compiler should produce an efficient scheduling, but my experiments have shown that it's not always the case). At some point, I tried the following code (it's the same as above, except that no temporary variables are used to store the results of the XORs that are later assigned to *r5 and *r6):
resched.c:
#include "iacaMarks.h"
#include <x86intrin.h>
#define AND(a,b) _mm_and_si128(a,b)
#define OR(a,b) _mm_or_si128(a,b)
#define XOR(a,b) _mm_xor_si128(a,b)
#define NOT(a) _mm_andnot_si128(a,_mm_set1_epi32(-1))
void sbox_resched (__m128i r0,__m128i r1,__m128i r2,__m128i r3,
__m128i* r5,__m128i* r6,__m128i* r7,__m128i* r8) {
__m128i r4;
IACA_START
r3 = XOR(r3,r0);
r4 = r1;
r1 = AND(r1,r3);
r4 = XOR(r4,r2);
r1 = XOR(r1,r0);
r0 = OR(r0,r3);
r0 = XOR(r0,r4);
r4 = XOR(r4,r3);
r3 = XOR(r3,r2);
r2 = OR(r2,r1);
r2 = XOR(r2,r4);
r4 = NOT(r4);
r4 = OR(r4,r1);
r1 = XOR(r1,r3);
r1 = XOR(r1,r4);
r3 = OR(r3,r0);
*r7 = r2;
*r8 = r0;
*r5 = XOR(r1,r3); // This two lines are different
*r6 = XOR(r4,r3); // (no more temporary variables)
IACA_END
}
I'm compiling these codes using Clang 5.0.0 targeting my i5-6500 (Skylake), with the flags -O3 -march=native (I'm omitting the assembly code produced, as they can be found in the IACA outputs bellow, but if you'd prefer to have them directly here, ask me and I'll add them). I benchmarked those two codes and didn't find any performance difference between them. Out of curiosity, I ran IACA on them, and I was surprised to see that it said that the first version should take 6 cycles to run, and the second version 7 cycles.
Here are the output produce by IACA:
For the first version:
dada#dada-ubuntu ~/perf % clang -O3 -march=native -c ref.c && ./iaca -arch SKL ref.o
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File - ref_iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 6.00 Cycles Throughput Bottleneck: FrontEnd
Loop Count: 23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 6.0 0.0 | 6.0 | 1.3 0.0 | 1.4 0.0 | 4.0 | 6.0 | 0.0 | 1.4 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vpand xmm5, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm5, xmm5, xmm0
| 1 | | 1.0 | | | | | | | vpor xmm0, xmm3, xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm0, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm1, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm3, xmm4, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpxor xmm2, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vpcmpeqd xmm4, xmm4, xmm4
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm4
| 1 | 1.0 | | | | | | | | vpor xmm1, xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm5, xmm3
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm0, xmm3
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm4, xmm3
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm3
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdi], xmm4
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rsi], xmm1
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdx], xmm2
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rcx], xmm0
Total Num Of Uops: 26
For the second version:
dada#dada-ubuntu ~/perf % clang -O3 -march=native -c resched.c && ./iaca -arch SKL resched.o
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File - resched_iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 7.00 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 6.0 0.0 | 6.0 | 1.3 0.0 | 1.4 0.0 | 4.0 | 6.0 | 0.0 | 1.3 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vpand xmm5, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm5, xmm5, xmm0
| 1 | | 1.0 | | | | | | | vpor xmm0, xmm3, xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm0, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm1, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm3, xmm4, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpxor xmm2, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vpcmpeqd xmm4, xmm4, xmm4
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm4
| 1 | 1.0 | | | | | | | | vpor xmm1, xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm5, xmm3
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm0, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdx], xmm2
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.4 | vmovdqa xmmword ptr [rcx], xmm0
| 1 | 1.0 | | | | | | | | vpxor xmm0, xmm4, xmm3
| 1 | | 1.0 | | | | | | | vpxor xmm0, xmm0, xmm1
| 2^ | | | 0.4 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdi], xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm1, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rsi], xmm0
Total Num Of Uops: 26
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.
As you can see, on the second version, IACA says that the bottleneck is the backend and that "Backend allocation was stalled due to unavailable allocation resources".
Both assembly codes contain the same instructions, and the only differences are the scheduling of the last 7 instructions, as well as the registers they use.
The only thing I can think of that would explain why the second code is slower is the fact that it writes twice xmm0 in the last 4 instructions, thus introducing a dependency. But since those writes are independent, I would expect the CPU to use different physical registers for them. However, I can't really prove that theory. Also, if using twice xmm0 like that were an issue, I would expect Clang to use a different register for one of the instructions (in particular since the register pressure here is low).
My question: is the second code supposed to be slower (based on the assembly code), and why?
Edit: IACA traces:
First version: https://pastebin.com/qGXHVW6a
Second version: https://pastebin.com/dbBNWsc2
Note: the C codes are implementations of Serpent cipher's first S-box, computed by Osvik here.

Figuring out why the second code is backend-bound requires some amount of manual analysis because the output emitted by IACA is too raw, although extremely rich in information. Note that the traces emitted by IACA are particularly useful for analyzing loops They can be also useful for understanding how straight-line sequences of instructions get executed (which is not as useful), but the emitted traces need to be interpreted differently. Throughput the rest of this answer, I will present my analysis for loop scenario, which is more difficult to do.
The fact that you emitted the traces without putting the code in a loop affects the following things:
the compiler couldn't inline and optimize away the stores to the output operands. They wouldn't appear at all in a real loop, or if chaining this to a different S-box.
the data dependencies from outputs to inputs happens by coincidence as the compiler used xmm0..3 to prepare data to be stored, not as consequence of choosing which output to feed back into which input of the same S-box.
the vpcmpeqd that creates an all-ones vector (for NOT) would be hoisted out of the loop after inlining.
There would be a dec/jnz or equivalent loop overhead (which can macro-fused into a single uop for port 6).
But you've asked IACA to analyze this exact block of asm as if it were run in a loop. So to explain the results, that's how we'll think of it (even though it's not what you'd get from a C compiler if you used this function in a loop).
A jmp or dec/jnz at the bottom to make this a loop is not a problem in this case: It will always get executed on port 6, which is not used by any vector instruction. This means that the jump instruction will not contend on port 6 and will not consume scheduler uop bandwidth that would otherwise have been used by other instructions. However, this can impact the resource allocator bandwidth in the issue/rename stage (which is no more than 4 fused domain uops per cycle), but this is not important in this particular case as I will discuss.
Let's first examine the port pressure ASCII figure:
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vpand xmm5, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm5, xmm5, xmm0
| 1 | | 1.0 | | | | | | | vpor xmm0, xmm3, xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm0, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm1, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm3, xmm4, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpxor xmm2, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vpcmpeqd xmm4, xmm4, xmm4
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm4
| 1 | 1.0 | | | | | | | | vpor xmm1, xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm5, xmm3
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm0, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdx], xmm2
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.4 | vmovdqa xmmword ptr [rcx], xmm0
| 1 | 1.0 | | | | | | | | vpxor xmm0, xmm4, xmm3
| 1 | | 1.0 | | | | | | | vpxor xmm0, xmm0, xmm1
| 2^ | | | 0.4 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdi], xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm1, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rsi], xmm0
The total number of fused domain uops is 22. 6 different uops have been assigned to each of port 0, 1, and 5. The other 4 uops each consists of an STD and STA uops. STD requires port 4. This assignment is reasonable. If we ignore all data dependencies, it appears the scheduler should be able to dispatch at least 3 fused domain uops every cycle. However, there can be serious contention at port 4, which may lead to filling up the reservation station. According to IACA, that is not the bottleneck in this code. Note that if the scheduler could somehow achieve a throughput that is equal to the maximum throughput of the allocator, then the code could only be frontend-bound. Obviously, this is not the case here.
The next step is to carefully examine the IACA trace. I made the following data flow graph based on the trace, which is easier to analyze. The horizontal yellow lines divide the graph according to which uops get allocated in the same cycle. Note that IACA always assumes perfect branch prediction. Also note that this division is about 99% accurate, but not 100%. This is not important and you can just consider it 100% accurate. The nodes represent fused uops and the arrows represent data dependence (where the arrow points to the destination uop). Nodes are colored depending on which loop iteration they belong to. The sources of the arrows at the top of the graph are omitted for clarity. The green boxes on the right contain the cycle number at which allocation is performed for the corresponding uops. So the previous cycle is X, and the current cycle is X + 1, whatever X is. The stop signs indicate that the associated uop suffers contention at one of the ports. All the red stop signs represent contention on port 1. There is only one other stop sign of different color that represents contention on port 5. There are are cases of contention, but I'll omitted them for clarity. Arrows come in two colors: blue and red. The ones are the critical ones. Note that it takes 11 cycles to allocate 2 iterations worth of instructions, and then the allocation pattern repeats. Keep in mind that Skylake has 97 RS entires.
The location of a node within each division (the "local" location) has a meaning. If two nodes are on the same row and if all of their operands are available, then it means that they can be dispatched in the same cycle. Otherwise, if the nodes are not on the same row, then they may not be dispatched in the same cycle. This only applies to dynamic uops that have been allocated together as a group and not to dynamic uops allocated as part of different groups even if they happen to be in the same division in the graph.
I'll use the notation (it, in) to identify a specific fused uop, where it is a zero-based loop iteration number and in is a zero-based uop number. The most important part of the IACA trace is the one that shows the pipeline stages for (11, 5):
11| 5|vpxor xmm0, xmm0, xmm1 : | | | | | | | | | | | | | |
11| 5| TYPE_OP (1 uops) : | | | | | |_A--------------------dw----R-------p | | | | |
This tells us that the allocation bandwidth is underutilized at this point due to unavailable resources (in this case, an entry in the reservation station). This means that the scheduler was not able to sustain a high enough throughput of unfused uops to keep up with the front-end 4 fused uops per cycle. Since IACA has already told us that the code is backend-bound, then obviously the reason for this underutilization is not because of some long dependency chain or contention at specific execution units, but rather something more complicated. So we need to do more work to figure out what's going on. We have to analyze past (11, 5).
The uops 1, 4, 7, 10, 13, 18 of every iteration are all assigned to port 1. What happens during a period of 11 cycles? There are a total of 12 uops that require port 1, so it's impossible to dispatch all of them in 11 cycles because it will take at least 12 cycles. Unfortunately, data dependencies within the uops that require the same port and across uops that require other ports exacerbate the problem significantly. Consider the following pipeline flow during an 11-cycle period:
At cycle 0: (0, 0) and (0, 1) get allocated (along with other uops that we don't care about right now). (0, 1) is data-dependent on (0, 0).
1: (0, 4) and (0, 7) get allocated. Assuming that no older and ready uops is assigned to port 0 and that the operands of (0, 0) are ready, dispatch (0, 0) to port 0. Port 1 potentially remains idle because (0, 1) is not ready yet.
2: The result of (0, 0) is available through the the bypass network. At this point, (0, 1) can and will be dispatched. However, even if (0, 4) or (0, 7) are ready, neither is the oldest uop assigned to port 1, so it both get blocked. (0, 10) gets allocated.
3: (0, 4) is dispatched to port 1. (0, 7) and (0, 10) both get blocked even if their operands are ready. (0, 13) gets allocated.
4: (0, 7) is dispatched to port 1. (0, 10) gets blocked. (0, 13) has to wait for (0, 7). (0, 18) gets allocated.
5: (0, 10) is dispatched to port 1. (0, 13) gets blocked. (0, 18) has to wait for (0, 17) which depends on (0, 13). (1, 0) and (1, 1) get allocated.
6: (0, 13) is dispatched to port 1. (0, 18) has to wait for (0, 17) which depends on (0, 13). (1, 1) has to wait for (1, 0). (1, 0) cannot be dispatched because the distance between (1, 0) and (0, 7) is 3 uops, one of which may suffer a port conflict. (1, 4) gets allocated.
7: Nothing gets dispatched to port 1 because (0, 18), (1, 1), and (1, 4) are not ready. (1, 7) gets allocated.
8: Nothing gets dispatched to port 1 because (0, 18), (1, 1), (1, 4), and (1, 7) are not ready. (1, 10) and (1, 13) get allocated.
9: (0, 18) is dispatched to port 1. (1, 10) and (1, 4) are ready but get blocked due to port contention. (1, 1), (1, 7), and (1, 13) are not ready.
10: (1, 1) is dispatched to port 1. (1, 4), (1, 7), and (1, 10) are ready but get blocked due to port contention. (1, 13) is not ready. (1, 18) gets allocated.
Well, ideally, we'd like 11 of the 12 uops to be dispatched to port 1 in 11 cycles. But this analysis shows that the situation is far from ideal. Port 1 is idle for 4 out of the 11 cycles! If we assume that some (X, 18) from a previous iteration gets dispatched at cycle 0, then port 1 would be idle for 3 cycles, which is a lot of waste, considering that we have 12 uops that require it every 11 cycles. Out of the 12 uops, only up to 8 got dispatched. How bad can the situation get? We can continue analyzing the trace and record how the number of p1-bound uops that are either ready to be dispatched but blocked due to conflict, or are not ready due to data decencies. I was able to determine that that the number of p1-bound uops stalled due to port conflict is never larger than 3. However, the number of p1-bound uops stalled due due to data decencies is overall increasing gradually with time. I did not see any pattern the way it increases, so I decided to use linear regression on the first 24 cycles of the trace to predict at what point there would be 97 such uops. The following figure shows that.
The x-axis represents the zero-based cycles increasing from left to right. Note that the number of uops is zero for the first 4 cycles. The y-axis represents the number of such uops at the corresponding cycle. The linear regression equation is:
y = 0.3624x - 0.6925.
By setting y to 97 we get:
x = (97 + 0.6925) / 0.3624 = 269.57
That is, at about cycle 269, we expect that there are 97 uops in the RS all p1-bound and waiting for their operands to become ready. It is at this point the RS is full. However, there can be other uops that are waiting in the RS for other reasons. So we expect that the allocator underutilize its bandwidth at or before cycle 269. by looking at the IACA trace for instruction (11, 5), we can see that the situation happens at cycle 61, which is much earlier than 269. This means that either my predictor is very optimistic or that the counts of uops bound to other ports exhibit also a similar behavior. My guts tell me it's the latter. But that is good enough to understand why IACA has said that the code is backend-bound. You can perform a similar analysis on the first code to understand why it's frontend-bound. I guess I'll just leave as an exercise for the reader.
This manual analysis can be followed in case IACA does not support a particular piece of code or when a tool like IACA does not exist for a particular microarhcitecture. The linear regression model enables to estimate after how many iterations the allocator underutilizes its bandwidth. For example in this case, cycle 269 corresponds to iteration 269/11/2 = 269/22 = 12. So as long as the maximum number of iterations is not much larger than 12, the backend performance of the loop would be less of an issue.
There is a related post by #Bee: How are x86 uops scheduled, exactly?.
I may post the details of what happens during the first 24 cycles later.
Side note: There are two errors in Wikichip's article on Skylake. First, Broadwell's scheduler has 60 entires, not 64. Second, the allocator's throughput is up to 4 fused uops only.

I benchmarked those two codes and didn't find any performance difference between them.
I did the same thing on my Skylake i7-6700k, actually benchmarking what you told IACA to analyze, by taking that asm and slapping a dec ebp / jnz .loop around it.
I found sbox_ref runs at ~7.50 cycles per iteration, while sbox_resched runs at ~8.04 c/iter, tested in a static executable on Linux, with performance counters. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details of my test methodology).
IACA's numbers are wrong, but it is correct that sbox_resched is slower.
Hadi's analysis appears correct: the dependency chains in the asm are long enough that any resource conflicts in uop scheduling will cause the back-end to lose throughput that it can never catch up from.
Presumably you benched by letting a C compiler inline that function into a loop, with local variables for the output operands. That will change the asm significantly (these are the reverse of the bullet points I edited into #Hadi's answer before writing my own):
Instead of happening by accident as the compiler uses xmm0..3 as scratch registers late in the function, the data dependencies from outputs to inputs are visible to the compiler so it can schedule appropriately. Your source code will choose which output to feed back into which input of the same S-box.
Or the deps don't exist (if you use constant inputs and avoid having the loop optimize away using volatile or an empty inline asm statement).
The stores to the output operands optimize away, like would happen for real if chaining this to a different S-box.
the vpcmpeqd that creates an all-ones vector (for NOT) would be hoisted out of the loop after inlining.
As Hadi says, the 1 uop macro-fused dec/jnz loop overhead doesn't compete for vector ALUs, so it itself isn't important. What is critically important is that slapping an asm loop around something the compiler didn't optimize as a loop body unsurprisingly gives silly results.

Related

Windows DLL user space calling process

I'm trying to understand how Windows recursively loads DLLs in user space.
Tracing kernel32.ReadProcessMemory as an example:
The first step for ReadProcessMemory is the IAT of kernel32:
00007FF901F6AFA0 | 48:FF25 21D20500 | jmp qword ptr ds:[<&ReadProcessMemory>] |
Which jmp's to kernelbase.ReadProcessMemory:
00007FF9002D22F0 | 48:83EC 48 | sub rsp,48 |
00007FF9002D22F4 | 48:8D4424 30 | lea rax,qword ptr ss:[rsp+30] |
00007FF9002D22F9 | 48:894424 20 | mov qword ptr ss:[rsp+20],rax |
00007FF9002D22FE | 48:FF15 C3521400 | call qword ptr ds:[<&ZwReadVirtualMemory>] |
<snip>
Which call's ntdll.ZwReadVirtualMemory:
00007FF902F5C840 | 4C:8BD1 | mov r10,rcx |
00007FF902F5C843 | B8 3F000000 | mov eax,3F | 3F:'?'
00007FF902F5C848 | F60425 0803FE7F 01 | test byte ptr ds:[7FFE0308],1 |
00007FF902F5C850 | 75 03 | jne ntdll.7FF902F5C855 |
00007FF902F5C852 | 0F05 | syscall |
00007FF902F5C854 | C3 | ret |
00007FF902F5C855 | CD 2E | int 2E |
00007FF902F5C857 | C3 | ret |
So the flow from user mode in this example is:
kernel32.ReadProcessMemory
kernelbase.ReadProcessMemory
ntdll.ZwReadVirtualMemory
The expectation would be that each of the above DLLs can 'find' the appropriate function based on their IAT/imported functions from other DLLs when they are loaded.
Using dumpbin and tracing the IMPORTS this is true for kernel32.ReadProcessMemory (where api-ms-win-core-memory-l1-1-0.dll is an ApiSet for kernelbase.dll):
api-ms-win-core-memory-l1-1-0.dll
180078178 Import Address Table
18009E120 Import Name Table
0 time date stamp
0 Index of first forwarder reference
35 VirtualQueryEx
<snip>
1C ReadProcessMemory
However, this is not true for kernelbase.dll - NtReadVirtualMemory is imported, however ZwReadVirtualMemory is not imported:
ntdll.dll
1801A67C8 Import Address Table
180262A48 Import Name Table
0 time date stamp
0 Index of first forwarder reference
893 __C_specific_handler
<snip>
205 NtReadVirtualMemory
So, my question is: during the DLL load process, how does kernelbase.dll identify the 'location' of ZwReadVirtualMemory if it isn't imported?
The ZwReadVirtualMemory function is being called by kernelbase.dll, so it must have been resolved/stored in the IAT at some point, but how does this happen technically?
Is there some indirection where the loader maps NtReadVirtualMemory to ZwReadVirtualMemory as these functions resolve to the same address?

Tensorflow: Multi-GPU training cannot make all GPU running at the same time

I have a machine that has 3x 1080 GPU. Below are the code of the training:
dynamic_learning_rate = tf.placeholder(tf.float32, shape=[])
model_version = tf.constant(1, tf.int32)
with tf.device('/cpu:0'):
with tf.name_scope('Input'):
# Input images and labels.
batch_images,\
batch_input_vectors,\
batch_one_hot_labels,\
batch_file_paths,\
batch_labels = self.get_batch()
grads = []
pred = []
cost = []
# Define optimizer
optimizer = tf.train.MomentumOptimizer(learning_rate=dynamic_learning_rate / self.batch_size,
momentum=0.9,
use_nesterov=True)
split_input_image = tf.split(batch_images, self.num_gpus)
split_input_vector = tf.split(batch_input_vectors, self.num_gpus)
split_input_one_hot_label = tf.split(batch_one_hot_labels, self.num_gpus)
for i in range(self.num_gpus):
with tf.device(tf.DeviceSpec(device_type="GPU", device_index=i)):
with tf.variable_scope(tf.get_variable_scope(), reuse=i > 0):
with tf.name_scope('Model'):
# Construct model
with tf.variable_scope("inference"):
tower_pred = self.model(split_input_image[i], split_input_vector[i], is_training=True)
pred.append(tower_pred)
with tf.name_scope('Loss'):
# Define loss and optimizer
softmax_cross_entropy_cost = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(logits=tower_pred, labels=split_input_one_hot_label[i]))
cost.append(softmax_cross_entropy_cost)
# Concat variables
pred = tf.concat(pred, 0)
cost = tf.reduce_mean(cost)
# L2 regularization
trainable_vars = tf.trainable_variables()
l2_regularization = tf.add_n(
[tf.nn.l2_loss(v) for v in trainable_vars if any(x in v.name for x in ['weights', 'biases'])])
for v in trainable_vars:
if any(x in v.name for x in ['weights', 'biases']):
print(v.name + ' - included for L2 regularization!')
else:
print(v.name)
cost = cost + self.l2_regularization_strength*l2_regularization
with tf.name_scope('Accuracy'):
# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(batch_one_hot_labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
prediction = tf.nn.softmax(pred, name='softmax')
# Creates a variable to hold the global_step.
global_step = tf.Variable(0, trainable=False, name='global_step')
# Minimization
update = optimizer.minimize(cost, global_step=global_step, colocate_gradients_with_ops=True)
After I run the training:
Fri Nov 10 12:28:00 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 42% 65C P2 62W / 198W | 7993MiB / 8114MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:04:00.0 Off | N/A |
| 33% 53C P2 150W / 198W | 7886MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 Off | 00000000:05:00.0 On | N/A |
| 26% 54C P2 170W / 198W | 7883MiB / 8108MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23228 C python 7982MiB |
| 1 23228 C python 7875MiB |
| 2 4793 G /usr/lib/xorg/Xorg 40MiB |
| 2 23228 C python 7831MiB |
+-----------------------------------------------------------------------------+
Fri Nov 10 12:28:36 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 42% 59C P2 54W / 198W | 7993MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:04:00.0 Off | N/A |
| 33% 57C P2 154W / 198W | 7886MiB / 8114MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 Off | 00000000:05:00.0 On | N/A |
| 27% 55C P2 155W / 198W | 7883MiB / 8108MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23228 C python 7982MiB |
| 1 23228 C python 7875MiB |
| 2 4793 G /usr/lib/xorg/Xorg 40MiB |
| 2 23228 C python 7831MiB |
+-----------------------------------------------------------------------------+
You see that the whenever the first GPU is running, the other two GPUs will be idle and vice versa. The alternate frequency is about 0.5 second.
For a single GPU, the training speed is around 650 [images/second], with all the 3 GPUs I got only 1050 [images/second].
Any idea of the problem?

You need to make sure that all the trainable variables are on the controller device (usually the CPU) and all the other worker devices (usually GPUs) are using the variables from the CPU in parallel.

Manually calculating time complexity of recursive Fibonacci algorithm

I am trying to understand the time complexity of the recursive Fibonacci algorithm.
fib(n)
if (n < 2)
return n
return fib(n-1)+fib(n-2)
Having not much mathematical background, I tried computing it by hand. That is, I manually count the number of steps as n increases. I ignore all things that I think are constant time. Here is how I did it. Say I want to compute fib(5).
n = 0 - just a comparison on an if statement. This is constant.
n = 1 - just a comparison on an if statement. This is constant.
n = 2 - ignoring anything else, this should be 2 steps, fib(1) takes 1 step and fib(0) takes 1 step.
n = 3 - 3 steps now, fib(2) takes two steps and fib(1) takes 1 step.
n = 4 - 5 steps now, fib(3) takes 3 steps and fib(2) takes 2 steps.
n = 5 - 8 steps now, fib(4) takes 5 steps and fib(3) takes 3 steps.
Judging from these, I believe the running time might be fib(n+1). I am not so sure if 1 is a constant factor because the difference between fib(n) and fib(n+1) might be very large.
I've read the following on SICP:
In general, the number of steps required by a tree-recursive process
will be proportional to the number of nodes in the tree, while the
space required will be proportional to the maximum depth of the tree.
In this case, I believe the number of nodes in the tree is fib(n+1). So I am confident I am correct. However, this video confuses me:
So this is a thing whose time complexity is order of actually, it
turns out to be Fibonacci of n. There's a thing that grows exactly as
Fibonacci numbers. 
...
That every one of these nodes in this tree has to be examined.
I am absolutely shocked. I've examined all nodes in the tree and there are always fib(n+1) nodes and thus number of steps when computing fib(n). I can't figure out why some people say it is fib(n) number of steps and not fib(n+1).
What am I doing wrong?

In your program, you have this time-consuming actions (sorted by time used per action, quick actions on top of the list):
Addition
IF (conditional jump)
Return from subroutine
Function call
Lets look at how many of this actions are executed, and lets compare this with n and fib(n):
n | fib | #ADD | #IF | #RET | #CALL
---+-----+------+-----+------+-------
0 | 0 | 0 | 1 | 1 | 0
1 | 1 | 0 | 1 | 1 | 0
For n≥2 you can calculate the numbers this way:
fib(n) = fib(n-1) + fib(n-2)
ADD(n) = 1 + ADD(n-1) + ADD(n-2)
IF(n) = 1 + IF(n-1) + IF(n-2)
RET(n) = 1 + RET(n-1) + RET(n-2)
CALL(n) = 2 + CALL(n-1) + CALL(n-2)
Why?
ADD: One addition is executed directly in the top instance of the program, but in the both subroutines, that you call are also additions, that need to be executed.
IF and RET: Same argument as before.
CALL: Also the same, but you execute two calls in the top instance.
So, this is your list for other values of n:
n | fib | #ADD | #IF | #RET | #CALL
---+--------+--------+--------+--------+--------
0 | 0 | 0 | 1 | 1 | 0
1 | 1 | 0 | 1 | 1 | 0
2 | 1 | 1 | 3 | 3 | 2
3 | 2 | 2 | 5 | 5 | 4
4 | 3 | 4 | 9 | 9 | 8
5 | 5 | 7 | 15 | 15 | 14
6 | 8 | 12 | 25 | 25 | 24
7 | 13 | 20 | 41 | 41 | 40
8 | 21 | 33 | 67 | 67 | 66
9 | 34 | 54 | 109 | 109 | 108
10 | 55 | 88 | 177 | 177 | 176
11 | 89 | 143 | 287 | 287 | 286
12 | 144 | 232 | 465 | 465 | 464
13 | 233 | 376 | 753 | 753 | 752
14 | 377 | 609 | 1219 | 1219 | 1218
15 | 610 | 986 | 1973 | 1973 | 1972
16 | 987 | 1596 | 3193 | 3193 | 3192
17 | 1597 | 2583 | 5167 | 5167 | 5166
18 | 2584 | 4180 | 8361 | 8361 | 8360
19 | 4181 | 6764 | 13529 | 13529 | 13528
20 | 6765 | 10945 | 21891 | 21891 | 21890
21 | 10946 | 17710 | 35421 | 35421 | 35420
22 | 17711 | 28656 | 57313 | 57313 | 57312
23 | 28657 | 46367 | 92735 | 92735 | 92734
24 | 46368 | 75024 | 150049 | 150049 | 150048
25 | 75025 | 121392 | 242785 | 242785 | 242784
26 | 121393 | 196417 | 392835 | 392835 | 392834
27 | 196418 | 317810 | 635621 | 635621 | 635620
You can see, that the number of additions is exactly the half of the number of function calls (well, you could have read this directly out of the code too). And if you count the initial program call as the very first function call, then you have exactly the same amount of IFs, returns and calls.
So you can combine 1 ADD, 2 IFs, 2 RETs and 2 CALLs to one super-action that needs a constant amount of time.
You can also read from the list, that the number of Additions is 1 less (which can be ignored) than fib(n+1).
So, the running time is of order fib(n+1).
The ratio fib(n+1) / fib(n) gets closer and closer to Φ, the bigger n grows. Φ is the golden ratio, i.e. 1.6180338997 which is a constant. And constant factors are ignored in orders. So, the order O(fib(n+1)) is exactly the same as O(fib(n)).
Now lets look at the space:
It is true, that the maximum space, needed to process a tree is equal to the maximum distance between the tree and the maximum distant leaf. This is true, because you call f(n-2) after f(n-1) returned.
So the space needed by your program is of order n.

Several int3 in a row

I'm using x64dbg to inspect the disassembly of a .DLL.
At several points in the assembly I see several Int3 instructions in a row.
00007FFA24BF1638 | CC | int3 |
00007FFA24BF1639 | CC | int3 |
00007FFA24BF163A | CC | int3 |
00007FFA24BF163B | CC | int3 |
00007FFA24BF163C | CC | int3 |
00007FFA24BF163D | CC | int3 |
00007FFA24BF163E | CC | int3 |
00007FFA24BF163F | CC | int3 |
This instruction is used for debugging / break points right? So then why are there so many in a row, and why is there any at all considering this DLL was compiled with a release configuration VC++.

It's probably just padding, they won't ever be executed. I assume the next function begins at 00007FFA24BF1640 which is 16 byte aligned, and the preceding function presumably ends before these instructions.

Mifare 1k value block operations

I have a problem with a Mifare Standard 1k card. I made a value block (00000001FFFFFFFE0000000100FF00FF - valid?) on the data block with address 62. The value of the value block is supposed to be 1, and address of the value block is 0.
I've changed the access bits for the data block 2 to be:
C1=1
C2=1
C3=0
The other 2 data blocks have factory access bits. Access bits for the sector trailer are also changed and are:
C1=0
C2=1
C3=1
So, access bits for the corresponding sector (16th sector) are 3B478C69 (valid?).
The problem is that I can't do any of the value block specific functions on that block (increment, decrement, etc), I always get 6A81 as response -> "Card is blocked or command not supported".
The APDU I'm using is FFF5C13E0400000001.

OMNIKEY readers have extensions to the PC/SC API for contactless memory cards. The commands defined by these extensions for increment and decrement of MIFARE Classic value blocks are:
Increment:
+------+------+------+------+------+-------------+
| CLA | INS | P1 | P2 | Lc | DATA |
+------+------+------+------+------+-------------+
| 0xFF | 0xD4 | BLOCK# | 0x04 | XX 00 00 00 |
+------+------+------+------+------+-------------+
or (depending on the firmware version???) the same command with a 1-byte data field:
+------+------+------+------+------+----+
| 0xFF | 0xD4 | BLOCK# | 0x01 | XX |
+------+------+------+------+------+----+
Decrement:
+------+------+------+------+------+-------------+
| CLA | INS | P1 | P2 | Lc | DATA |
+------+------+------+------+------+-------------+
| 0xFF | 0xD8 | BLOCK# | 0x04 | XX 00 00 00 |
+------+------+------+------+------+-------------+
or (depending on the firmware version???) the same command with a 1-byte data field:
+------+------+------+------+------+----+
| 0xFF | 0xD8 | BLOCK# | 0x01 | XX |
+------+------+------+------+------+----+
BLOCK#:
P1 is the MSB of the block number (always zero) and P2 is the LSB of the block number.
XX:
The increment/decrement value.
The commands are documented in OMNIKEY Contactless Smart Card Readers Developer Guide.
It seems as if both commands implicitly issue a transfer command to commit the operation. A restore command is not documented for the PC/SC extensions, however, the restore command is available through the OMNIKEY synchronous API.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Performance difference between two seemingly equivalent assembly codes - performance

Related

Windows DLL user space calling process

Tensorflow: Multi-GPU training cannot make all GPU running at the same time

Manually calculating time complexity of recursive Fibonacci algorithm

Several int3 in a row

Mifare 1k value block operations

Categories

Resources