Significant FMA performance anomaly experienced in the Intel Broadwell processor

Significant FMA performance anomaly experienced in the Intel Broadwell processor - performance

Code1:
vzeroall
mov rcx, 1000000
startLabel1:
vfmadd231ps ymm0, ymm0, ymm0
vfmadd231ps ymm1, ymm1, ymm1
vfmadd231ps ymm2, ymm2, ymm2
vfmadd231ps ymm3, ymm3, ymm3
vfmadd231ps ymm4, ymm4, ymm4
vfmadd231ps ymm5, ymm5, ymm5
vfmadd231ps ymm6, ymm6, ymm6
vfmadd231ps ymm7, ymm7, ymm7
vfmadd231ps ymm8, ymm8, ymm8
vfmadd231ps ymm9, ymm9, ymm9
vpaddd ymm10, ymm10, ymm10
vpaddd ymm11, ymm11, ymm11
vpaddd ymm12, ymm12, ymm12
vpaddd ymm13, ymm13, ymm13
vpaddd ymm14, ymm14, ymm14
dec rcx
jnz startLabel1
Code2:
vzeroall
mov rcx, 1000000
startLabel2:
vmulps ymm0, ymm0, ymm0
vmulps ymm1, ymm1, ymm1
vmulps ymm2, ymm2, ymm2
vmulps ymm3, ymm3, ymm3
vmulps ymm4, ymm4, ymm4
vmulps ymm5, ymm5, ymm5
vmulps ymm6, ymm6, ymm6
vmulps ymm7, ymm7, ymm7
vmulps ymm8, ymm8, ymm8
vmulps ymm9, ymm9, ymm9
vpaddd ymm10, ymm10, ymm10
vpaddd ymm11, ymm11, ymm11
vpaddd ymm12, ymm12, ymm12
vpaddd ymm13, ymm13, ymm13
vpaddd ymm14, ymm14, ymm14
dec rcx
jnz startLabel2
Code3 (same as Code2 but with long VEX prefix):
vzeroall
mov rcx, 1000000
startLabel3:
byte 0c4h, 0c1h, 07ch, 059h, 0c0h ;long VEX form vmulps ymm0, ymm0, ymm0
byte 0c4h, 0c1h, 074h, 059h, 0c9h ;long VEX form vmulps ymm1, ymm1, ymm1
byte 0c4h, 0c1h, 06ch, 059h, 0d2h ;long VEX form vmulps ymm2, ymm2, ymm2
byte 0c4h, 0c1h, 06ch, 059h, 0dbh ;long VEX form vmulps ymm3, ymm3, ymm3
byte 0c4h, 0c1h, 05ch, 059h, 0e4h ;long VEX form vmulps ymm4, ymm4, ymm4
byte 0c4h, 0c1h, 054h, 059h, 0edh ;long VEX form vmulps ymm5, ymm5, ymm5
byte 0c4h, 0c1h, 04ch, 059h, 0f6h ;long VEX form vmulps ymm6, ymm6, ymm6
byte 0c4h, 0c1h, 044h, 059h, 0ffh ;long VEX form vmulps ymm7, ymm7, ymm7
vmulps ymm8, ymm8, ymm8
vmulps ymm9, ymm9, ymm9
vpaddd ymm10, ymm10, ymm10
vpaddd ymm11, ymm11, ymm11
vpaddd ymm12, ymm12, ymm12
vpaddd ymm13, ymm13, ymm13
vpaddd ymm14, ymm14, ymm14
dec rcx
jnz startLabel3
Code4 (same as Code1 but with xmm registers):
vzeroall
mov rcx, 1000000
startLabel4:
vfmadd231ps xmm0, xmm0, xmm0
vfmadd231ps xmm1, xmm1, xmm1
vfmadd231ps xmm2, xmm2, xmm2
vfmadd231ps xmm3, xmm3, xmm3
vfmadd231ps xmm4, xmm4, xmm4
vfmadd231ps xmm5, xmm5, xmm5
vfmadd231ps xmm6, xmm6, xmm6
vfmadd231ps xmm7, xmm7, xmm7
vfmadd231ps xmm8, xmm8, xmm8
vfmadd231ps xmm9, xmm9, xmm9
vpaddd xmm10, xmm10, xmm10
vpaddd xmm11, xmm11, xmm11
vpaddd xmm12, xmm12, xmm12
vpaddd xmm13, xmm13, xmm13
vpaddd xmm14, xmm14, xmm14
dec rcx
jnz startLabel4
Code5 (same as Code1 but with nonzeroing vpsubd`s):
vzeroall
mov rcx, 1000000
startLabel5:
vfmadd231ps ymm0, ymm0, ymm0
vfmadd231ps ymm1, ymm1, ymm1
vfmadd231ps ymm2, ymm2, ymm2
vfmadd231ps ymm3, ymm3, ymm3
vfmadd231ps ymm4, ymm4, ymm4
vfmadd231ps ymm5, ymm5, ymm5
vfmadd231ps ymm6, ymm6, ymm6
vfmadd231ps ymm7, ymm7, ymm7
vfmadd231ps ymm8, ymm8, ymm8
vfmadd231ps ymm9, ymm9, ymm9
vpsubd ymm10, ymm10, ymm11
vpsubd ymm11, ymm11, ymm12
vpsubd ymm12, ymm12, ymm13
vpsubd ymm13, ymm13, ymm14
vpsubd ymm14, ymm14, ymm10
dec rcx
jnz startLabel5
Code6b: (revised, memory operands for vpaddds only)
vzeroall
mov rcx, 1000000
startLabel6:
vfmadd231ps ymm0, ymm0, ymm0
vfmadd231ps ymm1, ymm1, ymm1
vfmadd231ps ymm2, ymm2, ymm2
vfmadd231ps ymm3, ymm3, ymm3
vfmadd231ps ymm4, ymm4, ymm4
vfmadd231ps ymm5, ymm5, ymm5
vfmadd231ps ymm6, ymm6, ymm6
vfmadd231ps ymm7, ymm7, ymm7
vfmadd231ps ymm8, ymm8, ymm8
vfmadd231ps ymm9, ymm9, ymm9
vpaddd ymm10, ymm10, [mem]
vpaddd ymm11, ymm11, [mem]
vpaddd ymm12, ymm12, [mem]
vpaddd ymm13, ymm13, [mem]
vpaddd ymm14, ymm14, [mem]
dec rcx
jnz startLabel6
Code7: (same as Code1 but vpaddds use ymm15)
vzeroall
mov rcx, 1000000
startLabel7:
vfmadd231ps ymm0, ymm0, ymm0
vfmadd231ps ymm1, ymm1, ymm1
vfmadd231ps ymm2, ymm2, ymm2
vfmadd231ps ymm3, ymm3, ymm3
vfmadd231ps ymm4, ymm4, ymm4
vfmadd231ps ymm5, ymm5, ymm5
vfmadd231ps ymm6, ymm6, ymm6
vfmadd231ps ymm7, ymm7, ymm7
vfmadd231ps ymm8, ymm8, ymm8
vfmadd231ps ymm9, ymm9, ymm9
vpaddd ymm10, ymm15, ymm15
vpaddd ymm11, ymm15, ymm15
vpaddd ymm12, ymm15, ymm15
vpaddd ymm13, ymm15, ymm15
vpaddd ymm14, ymm15, ymm15
dec rcx
jnz startLabel7
Code8: (same as Code7 but uses xmm instead of ymm)
vzeroall
mov rcx, 1000000
startLabel8:
vfmadd231ps xmm0, ymm0, ymm0
vfmadd231ps xmm1, xmm1, xmm1
vfmadd231ps xmm2, xmm2, xmm2
vfmadd231ps xmm3, xmm3, xmm3
vfmadd231ps xmm4, xmm4, xmm4
vfmadd231ps xmm5, xmm5, xmm5
vfmadd231ps xmm6, xmm6, xmm6
vfmadd231ps xmm7, xmm7, xmm7
vfmadd231ps xmm8, xmm8, xmm8
vfmadd231ps xmm9, xmm9, xmm9
vpaddd xmm10, xmm15, xmm15
vpaddd xmm11, xmm15, xmm15
vpaddd xmm12, xmm15, xmm15
vpaddd xmm13, xmm15, xmm15
vpaddd xmm14, xmm15, xmm15
dec rcx
jnz startLabel8
Measured TSC clocks with Turbo and C1E disabled:
Haswell Broadwell Skylake
CPUID 306C3, 40661 306D4, 40671 506E3
Code1 ~5000000 ~7730000 ->~54% slower ~5500000 ->~10% slower
Code2 ~5000000 ~5000000 ~5000000
Code3 ~6000000 ~5000000 ~5000000
Code4 ~5000000 ~7730000 ~5500000
Code5 ~5000000 ~7730000 ~5500000
Code6b ~5000000 ~8380000 ~5500000
Code7 ~5000000 ~5000000 ~5000000
Code8 ~5000000 ~5000000 ~5000000
Can somebody explain what happens with Code1 on Broadwell? My guess is
Broadwell somehow contaminates Port1 with vpaddds in Code1 case, however
Haswell is able to use Port5 only if Port0 and Port1 is full;
Do you have any idea to accomplish the ~5000000 clk on Broadwell with FMA instructions?
I tried to reorder. Similar behavior experienced with double and qword;
I used Windows 8.1 and Win 10;
Update:
Added Code3 as Marat Dukhan's idea with long VEX;
Extended the result table with Skylake experiences;
Uploaded a VS2015 Community + MASM sample code here
Update2:
I tried with xmm registers instead of ymm (Code 4). Same result on Broadwell.
Update3:
I added Code5 as Peter Cordes idea (substitute vpaddd`s with other intructions (vpxor, vpor, vpand, vpandn, vpsubd)). If the new instruction not a zeroing idiom(vpxor, vpsubd with same register), the result is the same on BDW. Sample project updated with Code4 and Code5.
Update4:
I added Code6 as Stephen Canon`s idea (memory operands). The result is ~8200000 clks.
Sample project updated with Code6;
I checked the CPU freq and the possible thottling with System Stability Test of AIDA64. The frequency is stable and no sign of throttling;
Intel IACA 2.1 Haswell throughput analysis:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - Assembly.obj
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 5.10 Cycles Throughput Bottleneck: Port0, Port1, Port5
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 5.0 0.0 | 5.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 5.0 | 1.0 | 0.0 |
---------------------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | CP | vfmadd231ps ymm0, ymm0, ymm0
| 1 | | 1.0 | | | | | | | CP | vfmadd231ps ymm1, ymm1, ymm1
| 1 | 1.0 | | | | | | | | CP | vfmadd231ps ymm2, ymm2, ymm2
| 1 | | 1.0 | | | | | | | CP | vfmadd231ps ymm3, ymm3, ymm3
| 1 | 1.0 | | | | | | | | CP | vfmadd231ps ymm4, ymm4, ymm4
| 1 | | 1.0 | | | | | | | CP | vfmadd231ps ymm5, ymm5, ymm5
| 1 | 1.0 | | | | | | | | CP | vfmadd231ps ymm6, ymm6, ymm6
| 1 | | 1.0 | | | | | | | CP | vfmadd231ps ymm7, ymm7, ymm7
| 1 | 1.0 | | | | | | | | CP | vfmadd231ps ymm8, ymm8, ymm8
| 1 | | 1.0 | | | | | | | CP | vfmadd231ps ymm9, ymm9, ymm9
| 1 | | | | | | 1.0 | | | CP | vpaddd ymm10, ymm10, ymm10
| 1 | | | | | | 1.0 | | | CP | vpaddd ymm11, ymm11, ymm11
| 1 | | | | | | 1.0 | | | CP | vpaddd ymm12, ymm12, ymm12
| 1 | | | | | | 1.0 | | | CP | vpaddd ymm13, ymm13, ymm13
| 1 | | | | | | 1.0 | | | CP | vpaddd ymm14, ymm14, ymm14
| 1 | | | | | | | 1.0 | | | dec rcx
| 0F | | | | | | | | | | jnz 0xffffffffffffffaa
Total Num Of Uops: 16
I followed jcomeau_ictx idea, and modified the Agner Fog`s testp.zip (published 2015-12-22)
The port usage on the BDW 306D4:
Clock Core cyc Instruct uop p0 uop p1 uop p5 uop p6
Code1: 7734720 7734727 17000001 4983410 5016592 5000001 1000001
Code2: 5000072 5000072 17000001 5000010 5000014 4999978 1000002
The port distribution near perfect as on the Haswell. Then I checked the
resource stall counters (event 0xa2)
Clock Core cyc Instruct res.stl. RS stl. SB stl. ROB stl.
Code1: 7736212 7736213 17000001 3736191 3736143 0 0
Code2: 5000068 5000072 17000001 1000050 999957 0 0
It seems to me the Code1 and Code2 difference comming from the RS stall.
Remark from Intel SDM: "Cycles stalled due to no eligible RS entry
available."
How can I avoid this stall with FMA?
Update5:
Code6 changed, as Peter Cordes drew my attention, only vpaddds use memory operands. No effect on HSW and SKL, BDW get worse.
As Marat Dukhan measured, not just vpadd/vpsub/vpand/vpandn/vpxor affected, but other Port5 bounded instructions like vmovaps, vblendps, vpermps, vshufps, vbroadcastss;
As IwillnotexistIdonotexist suggested, I tried out with other operands. A successful modification is Code7, where all vpaddds use ymm15. This version can produce on BDWs ~5000000 clks, but just for a while. After ~6 million FMA pair it reaches the usual ~7730000 clks:
Clock Core cyc Instruct res.stl. RS stl. SB stl. ROB stl.
5133724 5110723 17000001 1107998 946376 0 0
6545476 6545482 17000001 2545453 1 0 0
6545468 6545471 17000001 2545437 90910 0 0
5000016 5000019 17000001 999992 999992 0 0
7671620 7617127 17000003 3614464 3363363 0 0
7737340 7737345 17000001 3737321 3737259 0 0
7802916 7747108 17000003 3737478 3735919 0 0
7928784 7796057 17000007 3767962 3676744 0 0
7941072 7847463 17000003 3781103 3651595 0 0
7787812 7779151 17000005 3765109 3685600 0 0
7792524 7738029 17000002 3736858 3736764 0 0
7736000 7736007 17000001 3735983 3735945 0 0
I tried the xmm version of Code7 as Code8. The effect is similar, but the faster runtime sustains longer. I haven't found significant difference between a 1.6GHz i5-5250U and 3.7GHz i7-5775C.
16 and 17 was made with disabled HyperThreading. With enabled HTT the effect is less.

Updated
I've got no explanation for you, since I'm on Haswell, but I do have code to share that might help you or someone else with Broadwell or Skylake hardware isolate your problem. If you could please run it on your machine and share the results, we could gain an insight into what's happening to your machine.
Intro
Recent Intel Core i7 processors have 7 performance monitor counters (PMCs), 3 fixed-function and 4 general-purpose, that may be used to profile code. The fixed-function PMCs are:
Instructions retired
Unhalted core cycles (Clock ticks including the effects of TurboBoost)
Unhalted Reference cycles (Fixed-frequency clock ticks)
The ratio of core:reference clock cycles determines the relative speedup or slowdown from dynamic frequency scaling.
Although software exists (see comments below) that accesses these counters, I did not know them and still find them to be insufficiently fine-grained.
I therefore wrote myself a Linux kernel module, perfcount, over the past few days to grant me access to the Intel performance counter monitors, and a userspace testbench and library for your code that wraps your FMA code around calls to my LKM. Instructions for how to reproduce my setup will follow.
My testbench source code is below. It warms up, then runs your code several times, testing it over a long list of metrics. I changed your loop count to 1 billion. Because only 4 general-purpose PMCs can be programmed at once, I do the measurements 4 at a time.
perfcountdemo.c
/* Includes */
#include "libperfcount.h"
#include <ctype.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Function prototypes */
void code1(void);
void code2(void);
void code3(void);
void code4(void);
void code5(void);
/* Global variables */
void ((*FN_TABLE[])(void)) = {
code1,
code2,
code3,
code4,
code5
};
/**
* Code snippets to bench
*/
void code1(void){
asm volatile(
".intel_syntax noprefix\n\t"
"vzeroall\n\t"
"mov rcx, 1000000000\n\t"
"LstartLabel1:\n\t"
"vfmadd231ps %%ymm0, %%ymm0, %%ymm0\n\t"
"vfmadd231ps ymm1, ymm1, ymm1\n\t"
"vfmadd231ps ymm2, ymm2, ymm2\n\t"
"vfmadd231ps ymm3, ymm3, ymm3\n\t"
"vfmadd231ps ymm4, ymm4, ymm4\n\t"
"vfmadd231ps ymm5, ymm5, ymm5\n\t"
"vfmadd231ps ymm6, ymm6, ymm6\n\t"
"vfmadd231ps ymm7, ymm7, ymm7\n\t"
"vfmadd231ps ymm8, ymm8, ymm8\n\t"
"vfmadd231ps ymm9, ymm9, ymm9\n\t"
"vpaddd ymm10, ymm10, ymm10\n\t"
"vpaddd ymm11, ymm11, ymm11\n\t"
"vpaddd ymm12, ymm12, ymm12\n\t"
"vpaddd ymm13, ymm13, ymm13\n\t"
"vpaddd ymm14, ymm14, ymm14\n\t"
"dec rcx\n\t"
"jnz LstartLabel1\n\t"
".att_syntax noprefix\n\t"
: /* No outputs we care about */
: /* No inputs we care about */
: "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7",
"xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15",
"rcx",
"memory"
);
}
void code2(void){
}
void code3(void){
}
void code4(void){
}
void code5(void){
}
/* Test Schedule */
const char* const SCHEDULE[] = {
/* Batch */
"uops_issued.any",
"uops_issued.any<1",
"uops_issued.any>=1",
"uops_issued.any>=2",
/* Batch */
"uops_issued.any>=3",
"uops_issued.any>=4",
"uops_issued.any>=5",
"uops_issued.any>=6",
/* Batch */
"uops_executed_port.port_0",
"uops_executed_port.port_1",
"uops_executed_port.port_2",
"uops_executed_port.port_3",
/* Batch */
"uops_executed_port.port_4",
"uops_executed_port.port_5",
"uops_executed_port.port_6",
"uops_executed_port.port_7",
/* Batch */
"resource_stalls.any",
"resource_stalls.rs",
"resource_stalls.sb",
"resource_stalls.rob",
/* Batch */
"uops_retired.all",
"uops_retired.all<1",
"uops_retired.all>=1",
"uops_retired.all>=2",
/* Batch */
"uops_retired.all>=3",
"uops_retired.all>=4",
"uops_retired.all>=5",
"uops_retired.all>=6",
/* Batch */
"inst_retired.any_p",
"inst_retired.any_p<1",
"inst_retired.any_p>=1",
"inst_retired.any_p>=2",
/* Batch */
"inst_retired.any_p>=3",
"inst_retired.any_p>=4",
"inst_retired.any_p>=5",
"inst_retired.any_p>=6",
/* Batch */
"idq_uops_not_delivered.core",
"idq_uops_not_delivered.core<1",
"idq_uops_not_delivered.core>=1",
"idq_uops_not_delivered.core>=2",
/* Batch */
"idq_uops_not_delivered.core>=3",
"idq_uops_not_delivered.core>=4",
"rs_events.empty",
"idq.empty",
/* Batch */
"idq.mite_all_uops",
"idq.mite_all_uops<1",
"idq.mite_all_uops>=1",
"idq.mite_all_uops>=2",
/* Batch */
"idq.mite_all_uops>=3",
"idq.mite_all_uops>=4",
"move_elimination.int_not_eliminated",
"move_elimination.simd_not_eliminated",
/* Batch */
"lsd.uops",
"lsd.uops<1",
"lsd.uops>=1",
"lsd.uops>=2",
/* Batch */
"lsd.uops>=3",
"lsd.uops>=4",
"ild_stall.lcp",
"ild_stall.iq_full",
/* Batch */
"br_inst_exec.all_branches",
"br_inst_exec.0x81",
"br_inst_exec.0x82",
"icache.misses",
/* Batch */
"br_misp_exec.all_branches",
"br_misp_exec.0x81",
"br_misp_exec.0x82",
"fp_assist.any",
/* Batch */
"cpu_clk_unhalted.core_clk",
"cpu_clk_unhalted.ref_xclk",
"baclears.any"
};
const int NUMCOUNTS = sizeof(SCHEDULE)/sizeof(*SCHEDULE);
/**
* Main
*/
int main(int argc, char* argv[]){
int i;
/**
* Initialize
*/
pfcInit();
if(argc <= 1){
pfcDumpEvents();
exit(1);
}
pfcPinThread(3);
/**
* Arguments are:
*
* perfcountdemo #codesnippet
*
* There is a schedule of configuration that is followed.
*/
void (*fn)(void) = FN_TABLE[strtoull(argv[1], NULL, 0)];
static const uint64_t ZERO_CNT[7] = {0,0,0,0,0,0,0};
static const uint64_t ZERO_CFG[7] = {0,0,0,0,0,0,0};
uint64_t cnt[7] = {0,0,0,0,0,0,0};
uint64_t cfg[7] = {2,2,2,0,0,0,0};
/* Warmup */
for(i=0;i<10;i++){
fn();
}
/* Run master loop */
for(i=0;i<NUMCOUNTS;i+=4){
/* Configure counters */
const char* sched0 = i+0 < NUMCOUNTS ? SCHEDULE[i+0] : "";
const char* sched1 = i+1 < NUMCOUNTS ? SCHEDULE[i+1] : "";
const char* sched2 = i+2 < NUMCOUNTS ? SCHEDULE[i+2] : "";
const char* sched3 = i+3 < NUMCOUNTS ? SCHEDULE[i+3] : "";
cfg[3] = pfcParseConfig(sched0);
cfg[4] = pfcParseConfig(sched1);
cfg[5] = pfcParseConfig(sched2);
cfg[6] = pfcParseConfig(sched3);
pfcWrConfigCnts(0, 7, cfg);
pfcWrCountsCnts(0, 7, ZERO_CNT);
pfcRdCountsCnts(0, 7, cnt);
/* ^ Should report 0s, and launch the counters. */
/************** Hot section **************/
fn();
/************ End Hot section ************/
pfcRdCountsCnts(0, 7, cnt);
pfcWrConfigCnts(0, 7, ZERO_CFG);
/* ^ Should clear the counter config and disable them. */
/**
* Print the lovely results
*/
printf("Instructions Issued : %20llu\n", cnt[0]);
printf("Unhalted core cycles : %20llu\n", cnt[1]);
printf("Unhalted reference cycles : %20llu\n", cnt[2]);
printf("%-35s: %20llu\n", sched0, cnt[3]);
printf("%-35s: %20llu\n", sched1, cnt[4]);
printf("%-35s: %20llu\n", sched2, cnt[5]);
printf("%-35s: %20llu\n", sched3, cnt[6]);
}
/**
* Close up shop
*/
pfcFini();
}
On my machine, I got the following results:
Haswell Core i7-4700MQ
> ./perfcountdemo 0
Instructions Issued : 17000001807
Unhalted core cycles : 5305920785
Unhalted reference cycles : 4245764952
uops_issued.any : 16000811079
uops_issued.any<1 : 1311417889
uops_issued.any>=1 : 4000292290
uops_issued.any>=2 : 4000229358
Instructions Issued : 17000001806
Unhalted core cycles : 5303822082
Unhalted reference cycles : 4243345896
uops_issued.any>=3 : 4000156998
uops_issued.any>=4 : 4000110067
uops_issued.any>=5 : 0
uops_issued.any>=6 : 0
Instructions Issued : 17000001811
Unhalted core cycles : 5314227923
Unhalted reference cycles : 4252020624
uops_executed_port.port_0 : 5016261477
uops_executed_port.port_1 : 5036728509
uops_executed_port.port_2 : 5282
uops_executed_port.port_3 : 12481
Instructions Issued : 17000001816
Unhalted core cycles : 5329351248
Unhalted reference cycles : 4265809728
uops_executed_port.port_4 : 7087
uops_executed_port.port_5 : 4946019835
uops_executed_port.port_6 : 1000228324
uops_executed_port.port_7 : 1372
Instructions Issued : 17000001816
Unhalted core cycles : 5325153463
Unhalted reference cycles : 4261060248
resource_stalls.any : 1322734589
resource_stalls.rs : 844250210
resource_stalls.sb : 0
resource_stalls.rob : 0
Instructions Issued : 17000001814
Unhalted core cycles : 5327823817
Unhalted reference cycles : 4262914728
uops_retired.all : 16000445793
uops_retired.all<1 : 687284798
uops_retired.all>=1 : 4646263984
uops_retired.all>=2 : 4452324050
Instructions Issued : 17000001809
Unhalted core cycles : 5311736558
Unhalted reference cycles : 4250015688
uops_retired.all>=3 : 3545695253
uops_retired.all>=4 : 3341664653
uops_retired.all>=5 : 1016
uops_retired.all>=6 : 1
Instructions Issued : 17000001871
Unhalted core cycles : 5477215269
Unhalted reference cycles : 4383891984
inst_retired.any_p : 17000001871
inst_retired.any_p<1 : 891904306
inst_retired.any_p>=1 : 4593972062
inst_retired.any_p>=2 : 4441024510
Instructions Issued : 17000001835
Unhalted core cycles : 5377202052
Unhalted reference cycles : 4302895152
inst_retired.any_p>=3 : 3555852364
inst_retired.any_p>=4 : 3369559466
inst_retired.any_p>=5 : 999980244
inst_retired.any_p>=6 : 0
Instructions Issued : 17000001826
Unhalted core cycles : 5349373678
Unhalted reference cycles : 4280991912
idq_uops_not_delivered.core : 1580573
idq_uops_not_delivered.core<1 : 5354931839
idq_uops_not_delivered.core>=1 : 471248
idq_uops_not_delivered.core>=2 : 418625
Instructions Issued : 17000001808
Unhalted core cycles : 5309687640
Unhalted reference cycles : 4248083976
idq_uops_not_delivered.core>=3 : 280800
idq_uops_not_delivered.core>=4 : 247923
rs_events.empty : 0
idq.empty : 649944
Instructions Issued : 17000001838
Unhalted core cycles : 5392229041
Unhalted reference cycles : 4315704216
idq.mite_all_uops : 2496139
idq.mite_all_uops<1 : 5397877484
idq.mite_all_uops>=1 : 971582
idq.mite_all_uops>=2 : 595973
Instructions Issued : 17000001822
Unhalted core cycles : 5347205506
Unhalted reference cycles : 4278845208
idq.mite_all_uops>=3 : 394011
idq.mite_all_uops>=4 : 335205
move_elimination.int_not_eliminated: 0
move_elimination.simd_not_eliminated: 0
Instructions Issued : 17000001812
Unhalted core cycles : 5320621549
Unhalted reference cycles : 4257095280
lsd.uops : 15999287982
lsd.uops<1 : 1326629729
lsd.uops>=1 : 3999821996
lsd.uops>=2 : 3999821996
Instructions Issued : 17000001813
Unhalted core cycles : 5320533147
Unhalted reference cycles : 4257105096
lsd.uops>=3 : 3999823498
lsd.uops>=4 : 3999823498
ild_stall.lcp : 0
ild_stall.iq_full : 3468
Instructions Issued : 17000001813
Unhalted core cycles : 5323278281
Unhalted reference cycles : 4258969200
br_inst_exec.all_branches : 1000016626
br_inst_exec.0x81 : 1000016616
br_inst_exec.0x82 : 0
icache.misses : 294
Instructions Issued : 17000001812
Unhalted core cycles : 5315098728
Unhalted reference cycles : 4253082504
br_misp_exec.all_branches : 5
br_misp_exec.0x81 : 2
br_misp_exec.0x82 : 0
fp_assist.any : 0
Instructions Issued : 17000001819
Unhalted core cycles : 5338484610
Unhalted reference cycles : 4271432976
cpu_clk_unhalted.core_clk : 5338494250
cpu_clk_unhalted.ref_xclk : 177976806
baclears.any : 1
: 0
We may see that on Haswell, everything is well-oiled. I'll make a few notes from the above stats:
Instructions issued is incredibly consistent for me. It's always around 17000001800, which is a good sign: It means we can make a very good estimate of our overhead. Idem for the other fixed-function counters. The fact that they all match reasonably well means that the tests in batches of 4 are apples-to-apples comparisons.
With a ratio of core:reference cycles of around 5305920785/4245764952, we get an average frequency scaling of ~1.25; This jives well with my observations that my core clocked up from 2.4 GHz to 3.0 GHz. cpu_clk_unhalted.core_clk/(10.0*cpu_clk_unhalted.ref_xclk) gives just under 3 GHz too.
The ratio of instructions issued to core cycles gives the IPC, 17000001807/5305920785 ~ 3.20, which is also about right: 2 FMA+1 VPADDD every clock cycle for 4 clock cycles, and 2 extra loop control instructions every 5th clock cycle that go in parallel.
uops_issued.any: The number of instructions issued is ~17B, but the number of uops issued is ~16B. That's because the two instructions for loop control are fusing together; Good sign. Moreover, around 1.3B clock cycles out of 5.3B (25% of the time), no uops were issued, while the near-totality of the rest of the time (4B clock cycles), 4 uops issued at a time.
uops_executed_port.port_[0-7]: Port saturation. We're in good health. Of the 16B post-fusion uops, Ports 0, 1 and 5 ate 5B uops each over 5.3B cycles (Which means they were distributed optimally: Float, float, int respectively), Port 6 ate 1B (the fused dec-branch op), and ports 2, 3, 4 and 7 ate negligible amounts by comparison.
resource_stalls: 1.3B of them occurred, 2/3 of which were due to the reservation station (RS) and the other third to unknown causes.
From the cumulative distribution we built with our comparisons on uops_retired.all and inst_retired.all, we know we are retiring 4 uops 60% of the time, 0 uops 13% of the time and 2 uops the rest of the time, with negligible amounts otherwise.
(Numerous *idq* counts): The IDQ only rarely holds us up.
lsd: The Loop Stream Detector is working; Nearly 16B fused uops were supplied to the frontend from it.
ild: Instruction length decoding is not the bottleneck, and not a single length-changing prefix is encountered.
br_inst_exec/br_misp_exec: Branch misprediction is a negligible problem.
icache.misses: Negligible.
fp_assist: Negligible. Denormals not encountered. (I believe that without DAZ denormals-are-zero flushing, they'd require an assist, which should register here)
So on Intel Haswell it's smooth sailing. If you could run my suite on your machines, that would be great.
Instructions for Reproduction
Rule #1: Inspect all my code before doing anything with it. Never blindly trust strangers on the Internet.
Grab perfcountdemo.c, libperfcount.c and libperfcount.h, put them in the same directory and compile them together.
Grab perfcount.c and Makefile, put them in the same directory, and make the kernel module.
Reboot your machine with the GRUB boot flags nmi_watchdog=0 modprobe.blacklist=iTCO_wdt,iTCO_vendor_support. The NMI watchdog will tamper with the unhalted-core-cycle counter otherwise.
insmod perfcount.ko the module. dmesg | tail -n 10 should say it successfully loaded and say there are 3 Ff counters and 4 Gp counters, or else give a reason for failing to do so.
Run my application, preferably while the rest of the system is not under load. Try also changing in perfcountdemo.c the core to which you restrict your affinity by changing the argument to pfcPinThread().
Edit in here the results.

Update: previous version contained a 6 VPADDD instructions (vs 5 in the question), and the extra VPADDD caused imbalance on Broadwell. After it was fixed, Haswell, Broadwell and Skylake issue almost the same number of uops to ports 0, 1 and 5.
There is no port contamination, but uops are scheduled suboptimally, with the majority of uops going to Port 5 on Broadwell, and making it the bottleneck before Ports 0 and 1 are saturated.
To demonstrate what is going on, I suggest to (ab)use the demo on PeachPy.IO:
Open www.peachpy.io in Google Chrome (it wouldn't work in other browsers).
Replace the default code (which implements SDOT function) with the code below, which is literally your example ported to PeachPy syntax:
n = Argument(size_t)
x = Argument(ptr(const_float_))
incx = Argument(size_t)
y = Argument(ptr(const_float_))
incy = Argument(size_t)
with Function("sdot", (n, x, incx, y, incy)) as function:
reg_n = GeneralPurposeRegister64()
LOAD.ARGUMENT(reg_n, n)
VZEROALL()
with Loop() as loop:
for i in range(15):
ymm_i = YMMRegister(i)
if i < 10:
VFMADD231PS(ymm_i, ymm_i, ymm_i)
else:
VPADDD(ymm_i, ymm_i, ymm_i)
DEC(reg_n)
JNZ(loop.begin)
RETURN()
I have a number of machines on different microarchitectures as a backend for PeachPy.io. Choose Intel Haswell, Intel Broadwell, or Intel Skylake and press "Quick Run". The system will compile your code, upload it to server, and visualize performance counters collected during execution.
Here is the uops distribution over execution ports on Intel Haswell:
And here is the same plot from Intel Broadwell:
Apparently, whatever was the flaw in uops scheduler, it was fixed in Intel Skylake, because port pressure on that machine is the same as on Haswell.

Related

GCC for Aarch64: what generated NOPs are used for?

I built CoreMark for Aarch64 using aarch64-none-elf-gcc with the following options:
-mcpu=cortex-a57 -Wall -Wextra -g -O2
In disassembled code I see many NOPs.
A few examples:
0000000040001540 <matrix_mul_const>:
40001540: 13003c63 sxth w3, w3
40001544: 34000240 cbz w0, 4000158c <matrix_mul_const+0x4c>
40001548: 2a0003e6 mov w6, w0
4000154c: 52800007 mov w7, #0x0 // #0
40001550: 52800008 mov w8, #0x0 // #0
40001554: d503201f nop
40001558: 2a0703e4 mov w4, w7
4000155c: d503201f nop
40001560: 78e45845 ldrsh w5, [x2, w4, uxtw #1]
...
00000000400013a0 <core_init_matrix>:
400013a0: 7100005f cmp w2, #0x0
400013a4: 2a0003e6 mov w6, w0
400013a8: 1a9f1442 csinc w2, w2, wzr, ne // ne = any
400013ac: 52800004 mov w4, #0x0 // #0
400013b0: 34000620 cbz w0, 40001474 <core_init_matrix+0xd4>
400013b4: d503201f nop
400013b8: 2a0403e0 mov w0, w4
400013bc: 11000484 add w4, w4, #0x1
A simple question: what these NOPs are used for?
UPD. Yes, it is related to alignment. Here is the corresponding generated assembly code:
matrix_mul_const:
.LVL41:
.LFB4:
.loc 1 270 1 is_stmt 1 view -0
.cfi_startproc
.loc 1 271 5 view .LVU127
.loc 1 272 5 view .LVU128
.loc 1 272 19 view .LVU129
.loc 1 270 1 is_stmt 0 view .LVU130
sxth w3, w3
.loc 1 272 19 view .LVU131
cbz w0, .L25
.loc 1 276 51 view .LVU132
mov w6, w0
mov w7, 0
.loc 1 272 12 view .LVU133
mov w8, 0
.LVL42:
.p2align 3,,7
.L27:
.loc 1 274 23 is_stmt 1 view .LVU134
.loc 1 270 1 is_stmt 0 view .LVU135
mov w4, w7
.LVL43:
.p2align 3,,7
.L28:
.loc 1 276 13 is_stmt 1 discriminator 3 view .LVU136
.loc 1 276 28 is_stmt 0 discriminator 3 view .LVU137
ldrsh w5, [x2, w4, uxtw 1]
Here we see .p2align 3,,7. These .p2align xxx are result of -O2:
$ aarch64-none-elf-gcc -Wall -Wextra -g -O1 -ffreestanding -c core_matrix.c -S ;\
grep '.p2align' core_matrix.s | sort | uniq
<nothing>
$ aarch64-none-elf-gcc -Wall -Wextra -g -O2 -ffreestanding -c core_matrix.c -S ;\
grep '.p2align' core_matrix.s | sort | uniq
.p2align 2,,3
.p2align 3,,7
.p2align 4,,11

Access violation while requesting memory in getmem.inc

I'm trying to understand an access violation (c0000005) in my C++ program build with "Codegear C++ Builder 2009"
I have catched the access violation with procdump and analysed it with WinDbg.
Here is the information i gathered with WinDbg.:
Callstack:
# ChildEBP RetAddr Args to Child
>00 0d98f960 004b3728 0001e000 108c6af0 0001e000 MMIServer!SystemSysGetMem$qqri+0x316 [GETMEM.INC # 2015]
01 0d98f978 004b443f 0d98f988 004890b8 108d5b20 MMIServer!SysReallocMem+0x2dc [GETMEM.INC # 3404]
02 0d98f998 0048910c 0d98f9f4 108c6af0 00482730 MMIServer!ReallocMem+0x13 [System.pas # 3521]
03 0d98f9ac 00488a40 09b60b58 108c6af0 0000f000 MMIServer!TMemoryStreamWrite+0x30 [Classes.pas # 6181]
04 0d98f9bc 00488b0e 0d98fa04 00488b3c 0d98f9f4 MMIServer!ClassesTStreamWriteBuffer+0x18 [Classes.pas # 5789]
05 0d98f9f4 0046d910 0000fc40 00000000 0d98fa10 MMIServer!ClassesTStreamCopyFrom+0xae [Classes.pas # 5814]
06 0d98fa38 0046dc85 0d98faec 0046dcec 0d98fad8 MMIServer!TOPToSoapDomConvertDOMToStream+0x80 [..\..\Patches\BDS2009\TByteDynArrayThroughSOAP\OPToSOAPDomConv.pas # 895]
07 0d98fad8 0052c272 01ca3930 10680b90 10d4daa8 MMIServer!TOPToSoapDomConvertMakeResponse+0x31d [..\..\Patches\BDS2009\TByteDynArrayThroughSOAP\OPToSOAPDomConv.pas # 987]
08 0d98fb94 0052eff9 0d98fcfb 10680b90 10680870 MMIServer!TSoapPascalInvokerInvoke+0x34e [SOAPPasInv.pas # 230]
09 0d98fc24 00526edb 0d98fcfb 10680b90 10680870 MMIServer!THTTPSoapPascalInvokerDispatchSOAP+0x20d [soaphttppasinv.pas # 82]
0a 0d98fd2c 00566d5b 10d1fee0 0d98fd6c 00566d83 MMIServer!THTTPSoapDispatcherDispatchRequest+0x28b [WebBrokerSOAP.pas # 223]
0b 0d98fd5c 00566e88 00000000 10d1fee0 0d98fda4 MMIServer!DispatchHandler+0x8b [HTTPApp.pas # 1511]
0c 0d98fd98 00567063 0046b308 0d98fdb0 0046b321 MMIServer!TCustomWebDispatcherDispatchAction+0xf0 [HTTPApp.pas # 1546]
0d 0d98fe10 0046b205 08dae450 02e3f360 09b21b70 MMIServer!TCustomWebDispatcherHandleRequest+0xb [HTTPApp.pas # 1594]
0e 0d98fe28 005b6e1f 08dae450 0d98fe40 005b6e43 MMIServer!TIdHTTPWebBrokerBridgeDoCommandGet+0x2d [..\..\Patches\Indy10\IdHTTPWebBrokerBridge.pas # 964]
0f 0d98fefc 005d86be 01c23370 005c7939 005c7783 MMIServer!IdcustomhttpserverTIdCustomHTTPServerDoExecute$qqrp20IdcontextTIdContext+0x683
10 0d98ff70 0048da71 0d98ff84 0048da7b 0d98ffa0 MMIServer!IdcontextTIdContextRun$qqrv+0x12
11 0d98ffa0 004b63f2 0d98ffdc 004b5e74 0d98ffb4 MMIServer!ThreadProc+0x45 [Classes.pas # 10892]
12 0d98ffb4 7c80b729 08b3e090 030afabc 00000000 MMIServer!ThreadWrapper+0x2a [System.pas # 13819]
CodeGear\RAD Studio\6.0\source\Win32\rtl\sys\getmem.inc:
2004 #GotBinAndGroup:
2005 {ebx = block size, ecx = bin number, edx = group number}
2006 push esi
2007 push edi
2008 {Get a pointer to the bin in edi}
2009 lea edi, [MediumBlockBins + ecx * 8]
2010 {Get the free block in esi}
2011 mov esi, TMediumFreeBlock[edi].NextFreeBlock
2012 {Remove the first block from the linked list (LIFO)}
2013 mov eax, TMediumFreeBlock[esi].NextFreeBlock
2014 mov TMediumFreeBlock[edi].NextFreeBlock, eax
>2015 mov TMediumFreeBlock[eax].PreviousFreeBlock, edi
2016 {Is this bin now empty?}
2017 cmp edi, eax
2018 jne #MediumBinNotEmptyForMedium
2019 {eax = bin group number, ecx = bin number, edi = #bin, esi = free block, ebx = block size}
2020 {Flag this bin as empty}
2021 mov eax, -2
2022 rol eax, cl
2023 and dword ptr [MediumBlockBinBitmaps + edx * 4], eax
2024 jnz #MediumBinNotEmptyForMedium
2025 {Flag the group as empty}
2026 btr MediumBlockBinGroupBitmap, edx
2027 #MediumBinNotEmptyForMedium:
Disassembly:
004b31d4 56 push esi
004b31d5 57 push edi
004b31d6 8d3ccd00be3201 lea edi,MMIServer!NeverSleepOnMMThreadContention+0x1ef (0132be00)[ecx*8]
004b31dd 8b7704 mov esi,dword ptr [edi+4]
004b31e0 8b4604 mov eax,dword ptr [esi+4]
004b31e3 894704 mov dword ptr [edi+4],eax
>004b31e6 8938 mov dword ptr [eax],edi ds:0023:00000000=????????
004b31e8 39c7 cmp edi,eax
004b31ea 7517 jne MMIServer!SystemSysGetMem$qqri+0x333 (004b3203)
004b31ec b8feffffff mov eax,0FFFFFFFEh
004b31f1 d3c0 rol eax,cl
004b31f3 21049580bd3201 and dword ptr MMIServer!NeverSleepOnMMThreadContention+0x16f (0132bd80)[edx*4],eax
004b31fa 7507 jne MMIServer!SystemSysGetMem$qqri+0x333 (004b3203)
004b31fc 0fb3157cbd3201 btr dword ptr [MMIServer!NeverSleepOnMMThreadContention+0x16b (0132bd7c)],edx
My understanding of this piece of code of getmem.inc is:
There is a double-linked list of free memory blocks.
One block is taken from this list.
The Double Linked List is reconnected.
In drawing:
|----------|--------------------->|----------|--------------------->|----------|
| | NextFreeBlock | | NextFreeBlock | |
| 132d408 | | 11290132 | | 00000000 |
| |<---------------------| |<---------------------| |
|----------| PreviousFreeBlock |----------| PreviousFreeBlock |----------|
^ ^ ^
| | |
edi esi eax
|----------|--------------------->|----------| |----------|
| | NextFreeBlock | | | |
| 132d408 | | 00000000 | | 11290132 |
| |<---------------------| | | |
|----------| PreviousFreeBlock |----------| |----------|
^ ^ ^
| | |
edi eax esi
Some registers:
eax 0
ebx 1e030
ecx 2c1
edx 16
edi 132d408
esi 11290132
ebp 1e000
eip 4b31e6
esp d98f958
There is a NULL pointer in the Double Linked Lists of the free memory blocks.
When writing the PreviousFreeBlock to this address the access violation occurred.
How can there be a NULL pointer in the Double Linked Lists of the free memory blocks?
Was the memory already corrupted?
Have anyone experienced the same problem in getmem.inc?
What can i do to investigate this crash further?

Is there a bottleneck when accessing "original" registers on an i7?

Short version:
On the Intel i7, is there some sort of bottleneck in accessing the "original" registers (eax, ebx, ecx, edx) that is not present in the "new" registers (r8d, r9d, etc.)? I'm timing some code, and, if I try to run three add instructions in parallel, I can get a CPI of .33, as long as only two of the three adds reference an "original" register (e.g., I use eax, ebx, and r9d). If I try to use three "original" registers, the CPI goes up to about .4. I observed is on both an i7-3770 and an i7-4790.
The details:
I trying to develop a new (hopefully interesting) lab for my Computer Architecture class. The goal is for them to time some assembly code on an Intel i7 processor and observe things like (a) the throughput of the processor, and (b) the consequences of data dependencies.
When I try to write some assembly code that exhibits an average CPI of .33 (i.e., demonstrates that the CPU can maintain a throughput of 3 instructions per cycle), I find that this is only possible if at most two of the three instructions access the "original" general purpose registers.
Experiment setup
Here is the basic outline of the experiment: Use rdtsc to time segments of a few thousand instructions, then plot the "cycle count" against the number of instructions timed to estimate the throughput. For example, running this code inside of a loop
mov $0, %eax
cpuid
rdtsc
movl %eax, %r12d
addl $1, %eax
addl $1, %eax
# the line above is copied "n" times
# (I use a ruby script to generate this part of the assembly)
addl $1, %eax
rdtsc
subl %r12d, %eax
allows us to report how long (in reference cycles) it takes to run a sequence of n addl instructions. (The segment of code above is part of a longer program that repeats the measurement many times, throws out the first few thousand trials, and reports the lowest and/or most common result.)
Results that makes sense
When I time a sequence of adds to a single register, I get the expected result:
instructions elapsed reference ref cycles estimated actual actual cycles
between rdtsc cycles per instruction cycles per instruction
200 145 0.72 169 0.84
300 220 0.73 256 0.85
400 314 0.79 365 0.91
500 408 0.82 474 0.95
600 483 0.81 562 0.94
700 577 0.82 671 0.96
800 652 0.81 758 0.95
900 746 0.83 867 0.96
1000 840 0.84 977 0.98
1100 915 0.83 1064 0.97
1200 1009 0.84 1173 0.98
........................................................................
3500 3019 0.86 3510 1.00
3600 3094 0.86 3598 1.00
3700 3188 0.86 3707 1.00
3800 3282 0.86 3816 1.00
3900 3357 0.86 3903 1.00
4000 3451 0.86 4013 1.00
After converting reference cycles to (estimated) actual cycles, the processor averages about one instruction per cycle. This makes sense because every timed instruction depends on the previous instruction, thereby preventing parallel execution. Notice that we do not issue a serializing instruction before ending rdtsc. As a result, the last few dozen timed instructions are not yet complete when we "stop" the timer. Consequently, the CPI for the first few rows of this table is artificially low. The effects of this "undercount" limit to zero as the number of instructions timed increases.
If we modify the timed code to alternate between additions to eax and ebx, we also get the expected result: A CPI that tends toward 0.5:
mov $0, %eax
cpuid
rdtsc
movl %eax, %r12d
addl $1, %eax
addl $1, %ebx
addl $1, %eax
addl $1, %ebx
# the pair of lines above are copied until there are `n` lines total being timed
addl $1, %eax
addl $1, %ebx
rdtsc
subl %r12d, %eax
instructions elapsed reference ref cycles estimated actual actual cycles
between rdtsc cycles per instruction cycles per instruction
1000 432 0.43 502 0.50
1200 510 0.42 593 0.49
1400 601 0.43 699 0.50
1600 695 0.43 808 0.51
1800 773 0.43 899 0.50
2000 864 0.43 1005 0.50
2200 955 0.43 1110 0.50
Question: Why does it matter which register I use when trying to run 3 instructions in parallel?
When I try to run adds to eax, ebx, and ecx in parallel, the CPI is higer than the expected .33:
mov $0, %eax
cpuid
rdtsc
movl %eax, %r12d
addl $1, %eax
addl $1, %ebx
addl $1, %ecx
addl $1, %eax
addl $1, %ebx
addl $1, %ecx
# the group of lines above are copied until there are `n` lines total being timed
addl $1, %eax
addl $1, %ebx
addl $1, %ecx
rdtsc
subl %r12d, %eax
instructions elapsed reference ref cycles estimated actual actual cycles
between rdtsc cycles per instruction cycles per instruction
1200 408 0.34 474 0.40
1500 492 0.33 572 0.38
1800 595 0.33 692 0.38
2100 698 0.33 812 0.39
2400 782 0.33 909 0.38
2700 885 0.33 1029 0.38
3000 988 0.33 1149 0.38
3300 1091 0.33 1269 0.38
3600 1178 0.33 1370 0.38
However, I get the expected result if I use r9d, r10d, and r11d:
instructions elapsed reference ref cycles estimated actual actual cycles
between rdtsc cycles per instruction cycles per instruction
1200 350 0.29 407 0.34
1500 444 0.30 516 0.34
1800 519 0.29 603 0.34
2100 613 0.29 713 0.34
2400 707 0.29 822 0.34
2700 782 0.29 909 0.34
3000 876 0.29 1019 0.34
In fact, I get the expected result as long as at most two of the three registers come from the set eax, ebx, ecx, and edx. Why is that? Any idea whether the bottleneck is in the issue, decoding, register renaming, or retirement?
I observed this behavior on both an i7-3770 and an i7-4790. For what it's worth: Both the Ryzen 7 and i5-6500 always have CPIs of .38 to .40, regardless of the registers used.
The code
For those who are curious, here is the template for the code I use:
.file "timestamp_shell.c"
.text
.section .rodata
.align 8
.LC0:
.string "%8d; Start %10u; Stop %10u; Difference %5d\n"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %r13
pushq %r12
pushq %rbx
subq $8, %rsp
.cfi_offset 13, -24
.cfi_offset 12, -32
.cfi_offset 3, -40
movl $100, %r12d
movl $200, %r13d
movl $-1, %r8d
movl $0, %r8d
jmp .L2
.L3:
mov $0, %eax
cpuid
rdtsc
movl %eax, %r12d
movl $0, %eax
# I use a perl script to copy the lines marked with ## until there
# is the desired number of instructions between the calls to rdstc
## addl $1, %eax
## addl $1, %r10d
## addl $1, %ecx
rdtsc
subl %r12d, %eax
movl %eax, %r8d
movl %r13d, %ecx
movl %r12d, %edx
movl %r8d, %esi
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf#PLT
addl $1, %r8d
.L2:
cmpl $999999, %r8d
jle .L3
movl $199, %eax
addq $8, %rsp
popq %rbx
popq %r12
popq %r13
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 8.2.1 20181127"
.section .note.GNU-stack,"",#progbits

Performance difference between two seemingly equivalent assembly codes

tl;dr: I have two functionally equivalent C codes that I compile with Clang (the fact that it's C code doesn't matter much; only the assembly is interesting I think), and IACA tells me that one should be faster, but I don't understand why, and my benchmarks show the same performance for the two codes.
I have the following C code (ignore #include "iacaMarks.h", IACA_START, IACA_END for now):
ref.c:
#include "iacaMarks.h"
#include <x86intrin.h>
#define AND(a,b) _mm_and_si128(a,b)
#define OR(a,b) _mm_or_si128(a,b)
#define XOR(a,b) _mm_xor_si128(a,b)
#define NOT(a) _mm_andnot_si128(a,_mm_set1_epi32(-1))
void sbox_ref (__m128i r0,__m128i r1,__m128i r2,__m128i r3,
__m128i* r5,__m128i* r6,__m128i* r7,__m128i* r8) {
__m128i r4;
IACA_START
r3 = XOR(r3,r0);
r4 = r1;
r1 = AND(r1,r3);
r4 = XOR(r4,r2);
r1 = XOR(r1,r0);
r0 = OR(r0,r3);
r0 = XOR(r0,r4);
r4 = XOR(r4,r3);
r3 = XOR(r3,r2);
r2 = OR(r2,r1);
r2 = XOR(r2,r4);
r4 = NOT(r4);
r4 = OR(r4,r1);
r1 = XOR(r1,r3);
r1 = XOR(r1,r4);
r3 = OR(r3,r0);
r1 = XOR(r1,r3);
r4 = XOR(r4,r3);
*r5 = r1;
*r6 = r4;
*r7 = r2;
*r8 = r0;
IACA_END
}
I was wondering if I could optimize it by manually rescheduling a few instructions (I am well aware that the C compiler should produce an efficient scheduling, but my experiments have shown that it's not always the case). At some point, I tried the following code (it's the same as above, except that no temporary variables are used to store the results of the XORs that are later assigned to *r5 and *r6):
resched.c:
#include "iacaMarks.h"
#include <x86intrin.h>
#define AND(a,b) _mm_and_si128(a,b)
#define OR(a,b) _mm_or_si128(a,b)
#define XOR(a,b) _mm_xor_si128(a,b)
#define NOT(a) _mm_andnot_si128(a,_mm_set1_epi32(-1))
void sbox_resched (__m128i r0,__m128i r1,__m128i r2,__m128i r3,
__m128i* r5,__m128i* r6,__m128i* r7,__m128i* r8) {
__m128i r4;
IACA_START
r3 = XOR(r3,r0);
r4 = r1;
r1 = AND(r1,r3);
r4 = XOR(r4,r2);
r1 = XOR(r1,r0);
r0 = OR(r0,r3);
r0 = XOR(r0,r4);
r4 = XOR(r4,r3);
r3 = XOR(r3,r2);
r2 = OR(r2,r1);
r2 = XOR(r2,r4);
r4 = NOT(r4);
r4 = OR(r4,r1);
r1 = XOR(r1,r3);
r1 = XOR(r1,r4);
r3 = OR(r3,r0);
*r7 = r2;
*r8 = r0;
*r5 = XOR(r1,r3); // This two lines are different
*r6 = XOR(r4,r3); // (no more temporary variables)
IACA_END
}
I'm compiling these codes using Clang 5.0.0 targeting my i5-6500 (Skylake), with the flags -O3 -march=native (I'm omitting the assembly code produced, as they can be found in the IACA outputs bellow, but if you'd prefer to have them directly here, ask me and I'll add them). I benchmarked those two codes and didn't find any performance difference between them. Out of curiosity, I ran IACA on them, and I was surprised to see that it said that the first version should take 6 cycles to run, and the second version 7 cycles.
Here are the output produce by IACA:
For the first version:
dada#dada-ubuntu ~/perf % clang -O3 -march=native -c ref.c && ./iaca -arch SKL ref.o
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File - ref_iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 6.00 Cycles Throughput Bottleneck: FrontEnd
Loop Count: 23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 6.0 0.0 | 6.0 | 1.3 0.0 | 1.4 0.0 | 4.0 | 6.0 | 0.0 | 1.4 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vpand xmm5, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm5, xmm5, xmm0
| 1 | | 1.0 | | | | | | | vpor xmm0, xmm3, xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm0, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm1, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm3, xmm4, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpxor xmm2, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vpcmpeqd xmm4, xmm4, xmm4
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm4
| 1 | 1.0 | | | | | | | | vpor xmm1, xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm5, xmm3
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm0, xmm3
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm4, xmm3
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm3
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdi], xmm4
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rsi], xmm1
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdx], xmm2
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rcx], xmm0
Total Num Of Uops: 26
For the second version:
dada#dada-ubuntu ~/perf % clang -O3 -march=native -c resched.c && ./iaca -arch SKL resched.o
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File - resched_iaca.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 7.00 Cycles Throughput Bottleneck: Backend
Loop Count: 22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 6.0 0.0 | 6.0 | 1.3 0.0 | 1.4 0.0 | 4.0 | 6.0 | 0.0 | 1.3 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vpand xmm5, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm5, xmm5, xmm0
| 1 | | 1.0 | | | | | | | vpor xmm0, xmm3, xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm0, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm1, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm3, xmm4, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpxor xmm2, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vpcmpeqd xmm4, xmm4, xmm4
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm4
| 1 | 1.0 | | | | | | | | vpor xmm1, xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm5, xmm3
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm0, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdx], xmm2
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.4 | vmovdqa xmmword ptr [rcx], xmm0
| 1 | 1.0 | | | | | | | | vpxor xmm0, xmm4, xmm3
| 1 | | 1.0 | | | | | | | vpxor xmm0, xmm0, xmm1
| 2^ | | | 0.4 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdi], xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm1, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rsi], xmm0
Total Num Of Uops: 26
Analysis Notes:
Backend allocation was stalled due to unavailable allocation resources.
As you can see, on the second version, IACA says that the bottleneck is the backend and that "Backend allocation was stalled due to unavailable allocation resources".
Both assembly codes contain the same instructions, and the only differences are the scheduling of the last 7 instructions, as well as the registers they use.
The only thing I can think of that would explain why the second code is slower is the fact that it writes twice xmm0 in the last 4 instructions, thus introducing a dependency. But since those writes are independent, I would expect the CPU to use different physical registers for them. However, I can't really prove that theory. Also, if using twice xmm0 like that were an issue, I would expect Clang to use a different register for one of the instructions (in particular since the register pressure here is low).
My question: is the second code supposed to be slower (based on the assembly code), and why?
Edit: IACA traces:
First version: https://pastebin.com/qGXHVW6a
Second version: https://pastebin.com/dbBNWsc2
Note: the C codes are implementations of Serpent cipher's first S-box, computed by Osvik here.

Figuring out why the second code is backend-bound requires some amount of manual analysis because the output emitted by IACA is too raw, although extremely rich in information. Note that the traces emitted by IACA are particularly useful for analyzing loops They can be also useful for understanding how straight-line sequences of instructions get executed (which is not as useful), but the emitted traces need to be interpreted differently. Throughput the rest of this answer, I will present my analysis for loop scenario, which is more difficult to do.
The fact that you emitted the traces without putting the code in a loop affects the following things:
the compiler couldn't inline and optimize away the stores to the output operands. They wouldn't appear at all in a real loop, or if chaining this to a different S-box.
the data dependencies from outputs to inputs happens by coincidence as the compiler used xmm0..3 to prepare data to be stored, not as consequence of choosing which output to feed back into which input of the same S-box.
the vpcmpeqd that creates an all-ones vector (for NOT) would be hoisted out of the loop after inlining.
There would be a dec/jnz or equivalent loop overhead (which can macro-fused into a single uop for port 6).
But you've asked IACA to analyze this exact block of asm as if it were run in a loop. So to explain the results, that's how we'll think of it (even though it's not what you'd get from a C compiler if you used this function in a loop).
A jmp or dec/jnz at the bottom to make this a loop is not a problem in this case: It will always get executed on port 6, which is not used by any vector instruction. This means that the jump instruction will not contend on port 6 and will not consume scheduler uop bandwidth that would otherwise have been used by other instructions. However, this can impact the resource allocator bandwidth in the issue/rename stage (which is no more than 4 fused domain uops per cycle), but this is not important in this particular case as I will discuss.
Let's first examine the port pressure ASCII figure:
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | 1.0 | | | | | | | | vpxor xmm4, xmm3, xmm0
| 1 | | 1.0 | | | | | | | vpand xmm5, xmm4, xmm1
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm2, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm5, xmm5, xmm0
| 1 | | 1.0 | | | | | | | vpor xmm0, xmm3, xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm0, xmm1
| 1 | 1.0 | | | | | | | | vpxor xmm1, xmm4, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm3, xmm4, xmm2
| 1 | | | | | | 1.0 | | | vpor xmm2, xmm5, xmm2
| 1 | 1.0 | | | | | | | | vpxor xmm2, xmm2, xmm1
| 1 | | 1.0 | | | | | | | vpcmpeqd xmm4, xmm4, xmm4
| 1 | | | | | | 1.0 | | | vpxor xmm1, xmm1, xmm4
| 1 | 1.0 | | | | | | | | vpor xmm1, xmm5, xmm1
| 1 | | 1.0 | | | | | | | vpxor xmm4, xmm5, xmm3
| 1 | | | | | | 1.0 | | | vpor xmm3, xmm0, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdx], xmm2
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.4 | vmovdqa xmmword ptr [rcx], xmm0
| 1 | 1.0 | | | | | | | | vpxor xmm0, xmm4, xmm3
| 1 | | 1.0 | | | | | | | vpxor xmm0, xmm0, xmm1
| 2^ | | | 0.4 | 0.3 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rdi], xmm0
| 1 | | | | | | 1.0 | | | vpxor xmm0, xmm1, xmm3
| 2^ | | | 0.3 | 0.4 | 1.0 | | | 0.3 | vmovdqa xmmword ptr [rsi], xmm0
The total number of fused domain uops is 22. 6 different uops have been assigned to each of port 0, 1, and 5. The other 4 uops each consists of an STD and STA uops. STD requires port 4. This assignment is reasonable. If we ignore all data dependencies, it appears the scheduler should be able to dispatch at least 3 fused domain uops every cycle. However, there can be serious contention at port 4, which may lead to filling up the reservation station. According to IACA, that is not the bottleneck in this code. Note that if the scheduler could somehow achieve a throughput that is equal to the maximum throughput of the allocator, then the code could only be frontend-bound. Obviously, this is not the case here.
The next step is to carefully examine the IACA trace. I made the following data flow graph based on the trace, which is easier to analyze. The horizontal yellow lines divide the graph according to which uops get allocated in the same cycle. Note that IACA always assumes perfect branch prediction. Also note that this division is about 99% accurate, but not 100%. This is not important and you can just consider it 100% accurate. The nodes represent fused uops and the arrows represent data dependence (where the arrow points to the destination uop). Nodes are colored depending on which loop iteration they belong to. The sources of the arrows at the top of the graph are omitted for clarity. The green boxes on the right contain the cycle number at which allocation is performed for the corresponding uops. So the previous cycle is X, and the current cycle is X + 1, whatever X is. The stop signs indicate that the associated uop suffers contention at one of the ports. All the red stop signs represent contention on port 1. There is only one other stop sign of different color that represents contention on port 5. There are are cases of contention, but I'll omitted them for clarity. Arrows come in two colors: blue and red. The ones are the critical ones. Note that it takes 11 cycles to allocate 2 iterations worth of instructions, and then the allocation pattern repeats. Keep in mind that Skylake has 97 RS entires.
The location of a node within each division (the "local" location) has a meaning. If two nodes are on the same row and if all of their operands are available, then it means that they can be dispatched in the same cycle. Otherwise, if the nodes are not on the same row, then they may not be dispatched in the same cycle. This only applies to dynamic uops that have been allocated together as a group and not to dynamic uops allocated as part of different groups even if they happen to be in the same division in the graph.
I'll use the notation (it, in) to identify a specific fused uop, where it is a zero-based loop iteration number and in is a zero-based uop number. The most important part of the IACA trace is the one that shows the pipeline stages for (11, 5):
11| 5|vpxor xmm0, xmm0, xmm1 : | | | | | | | | | | | | | |
11| 5| TYPE_OP (1 uops) : | | | | | |_A--------------------dw----R-------p | | | | |
This tells us that the allocation bandwidth is underutilized at this point due to unavailable resources (in this case, an entry in the reservation station). This means that the scheduler was not able to sustain a high enough throughput of unfused uops to keep up with the front-end 4 fused uops per cycle. Since IACA has already told us that the code is backend-bound, then obviously the reason for this underutilization is not because of some long dependency chain or contention at specific execution units, but rather something more complicated. So we need to do more work to figure out what's going on. We have to analyze past (11, 5).
The uops 1, 4, 7, 10, 13, 18 of every iteration are all assigned to port 1. What happens during a period of 11 cycles? There are a total of 12 uops that require port 1, so it's impossible to dispatch all of them in 11 cycles because it will take at least 12 cycles. Unfortunately, data dependencies within the uops that require the same port and across uops that require other ports exacerbate the problem significantly. Consider the following pipeline flow during an 11-cycle period:
At cycle 0: (0, 0) and (0, 1) get allocated (along with other uops that we don't care about right now). (0, 1) is data-dependent on (0, 0).
1: (0, 4) and (0, 7) get allocated. Assuming that no older and ready uops is assigned to port 0 and that the operands of (0, 0) are ready, dispatch (0, 0) to port 0. Port 1 potentially remains idle because (0, 1) is not ready yet.
2: The result of (0, 0) is available through the the bypass network. At this point, (0, 1) can and will be dispatched. However, even if (0, 4) or (0, 7) are ready, neither is the oldest uop assigned to port 1, so it both get blocked. (0, 10) gets allocated.
3: (0, 4) is dispatched to port 1. (0, 7) and (0, 10) both get blocked even if their operands are ready. (0, 13) gets allocated.
4: (0, 7) is dispatched to port 1. (0, 10) gets blocked. (0, 13) has to wait for (0, 7). (0, 18) gets allocated.
5: (0, 10) is dispatched to port 1. (0, 13) gets blocked. (0, 18) has to wait for (0, 17) which depends on (0, 13). (1, 0) and (1, 1) get allocated.
6: (0, 13) is dispatched to port 1. (0, 18) has to wait for (0, 17) which depends on (0, 13). (1, 1) has to wait for (1, 0). (1, 0) cannot be dispatched because the distance between (1, 0) and (0, 7) is 3 uops, one of which may suffer a port conflict. (1, 4) gets allocated.
7: Nothing gets dispatched to port 1 because (0, 18), (1, 1), and (1, 4) are not ready. (1, 7) gets allocated.
8: Nothing gets dispatched to port 1 because (0, 18), (1, 1), (1, 4), and (1, 7) are not ready. (1, 10) and (1, 13) get allocated.
9: (0, 18) is dispatched to port 1. (1, 10) and (1, 4) are ready but get blocked due to port contention. (1, 1), (1, 7), and (1, 13) are not ready.
10: (1, 1) is dispatched to port 1. (1, 4), (1, 7), and (1, 10) are ready but get blocked due to port contention. (1, 13) is not ready. (1, 18) gets allocated.
Well, ideally, we'd like 11 of the 12 uops to be dispatched to port 1 in 11 cycles. But this analysis shows that the situation is far from ideal. Port 1 is idle for 4 out of the 11 cycles! If we assume that some (X, 18) from a previous iteration gets dispatched at cycle 0, then port 1 would be idle for 3 cycles, which is a lot of waste, considering that we have 12 uops that require it every 11 cycles. Out of the 12 uops, only up to 8 got dispatched. How bad can the situation get? We can continue analyzing the trace and record how the number of p1-bound uops that are either ready to be dispatched but blocked due to conflict, or are not ready due to data decencies. I was able to determine that that the number of p1-bound uops stalled due to port conflict is never larger than 3. However, the number of p1-bound uops stalled due due to data decencies is overall increasing gradually with time. I did not see any pattern the way it increases, so I decided to use linear regression on the first 24 cycles of the trace to predict at what point there would be 97 such uops. The following figure shows that.
The x-axis represents the zero-based cycles increasing from left to right. Note that the number of uops is zero for the first 4 cycles. The y-axis represents the number of such uops at the corresponding cycle. The linear regression equation is:
y = 0.3624x - 0.6925.
By setting y to 97 we get:
x = (97 + 0.6925) / 0.3624 = 269.57
That is, at about cycle 269, we expect that there are 97 uops in the RS all p1-bound and waiting for their operands to become ready. It is at this point the RS is full. However, there can be other uops that are waiting in the RS for other reasons. So we expect that the allocator underutilize its bandwidth at or before cycle 269. by looking at the IACA trace for instruction (11, 5), we can see that the situation happens at cycle 61, which is much earlier than 269. This means that either my predictor is very optimistic or that the counts of uops bound to other ports exhibit also a similar behavior. My guts tell me it's the latter. But that is good enough to understand why IACA has said that the code is backend-bound. You can perform a similar analysis on the first code to understand why it's frontend-bound. I guess I'll just leave as an exercise for the reader.
This manual analysis can be followed in case IACA does not support a particular piece of code or when a tool like IACA does not exist for a particular microarhcitecture. The linear regression model enables to estimate after how many iterations the allocator underutilizes its bandwidth. For example in this case, cycle 269 corresponds to iteration 269/11/2 = 269/22 = 12. So as long as the maximum number of iterations is not much larger than 12, the backend performance of the loop would be less of an issue.
There is a related post by #Bee: How are x86 uops scheduled, exactly?.
I may post the details of what happens during the first 24 cycles later.
Side note: There are two errors in Wikichip's article on Skylake. First, Broadwell's scheduler has 60 entires, not 64. Second, the allocator's throughput is up to 4 fused uops only.

I benchmarked those two codes and didn't find any performance difference between them.
I did the same thing on my Skylake i7-6700k, actually benchmarking what you told IACA to analyze, by taking that asm and slapping a dec ebp / jnz .loop around it.
I found sbox_ref runs at ~7.50 cycles per iteration, while sbox_resched runs at ~8.04 c/iter, tested in a static executable on Linux, with performance counters. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details of my test methodology).
IACA's numbers are wrong, but it is correct that sbox_resched is slower.
Hadi's analysis appears correct: the dependency chains in the asm are long enough that any resource conflicts in uop scheduling will cause the back-end to lose throughput that it can never catch up from.
Presumably you benched by letting a C compiler inline that function into a loop, with local variables for the output operands. That will change the asm significantly (these are the reverse of the bullet points I edited into #Hadi's answer before writing my own):
Instead of happening by accident as the compiler uses xmm0..3 as scratch registers late in the function, the data dependencies from outputs to inputs are visible to the compiler so it can schedule appropriately. Your source code will choose which output to feed back into which input of the same S-box.
Or the deps don't exist (if you use constant inputs and avoid having the loop optimize away using volatile or an empty inline asm statement).
The stores to the output operands optimize away, like would happen for real if chaining this to a different S-box.
the vpcmpeqd that creates an all-ones vector (for NOT) would be hoisted out of the loop after inlining.
As Hadi says, the 1 uop macro-fused dec/jnz loop overhead doesn't compete for vector ALUs, so it itself isn't important. What is critically important is that slapping an asm loop around something the compiler didn't optimize as a loop body unsurprisingly gives silly results.

How to make GCC generate vector instructions as ICC does?

I've been using ICC on my project, and ICC will utilize vector instructions very well. recently I tried to use GCC (version 5.5) to compile the same code, however on some modules, GCC's version is 10 times slower than ICC's. This happens when I do complex multiply etc.
A sample code will be like:
definitions:
float *ptr1 = _mm_malloc(1280 , 64);
float *ptr2 = _mm_malloc(1280 , 64);
float complex *realptr1 = (float complex *)&ptr1[storageOffset];
float complex *realptr2 = (float complex *)&ptr2[storageOffset];
Pragma and compiler options:
__assume_aligned(realptr1, 64);
__assume_aligned(realptr2, 64);
#pragma ivdep
#pragma vector aligned
for (j = 0; j < 512; j++) {
float complex derSlot0 = realptr1[j] * realptr2[j];
float complex derSlot1 = realptr1[j] + realptr2[j];
realptr1[j] = derSlot0;
realptr2[j] = derSlot1;
}
ICC compiled result of the major loop will be like:
..B1.6: # Preds ..B1.6 ..B1.5
# Execution count [5.12e+02]
vmovups 32(%r15,%rdx,8), %ymm9 #35.29
lea (%r15,%rdx,8), %rax #37.5
vmovups (%rax), %ymm3 #35.29
vaddps 32(%rbx,%rdx,8), %ymm9, %ymm11 #36.43
vaddps (%rbx,%rdx,8), %ymm3, %ymm5 #36.43
vmovshdup 32(%rbx,%rdx,8), %ymm6 #35.43
vshufps $177, %ymm9, %ymm9, %ymm7 #35.43
vmulps %ymm7, %ymm6, %ymm8 #35.43
vmovshdup (%rbx,%rdx,8), %ymm0 #35.43
vshufps $177, %ymm3, %ymm3, %ymm1 #35.43
vmulps %ymm1, %ymm0, %ymm2 #35.43
vmovsldup 32(%rbx,%rdx,8), %ymm10 #35.43
vfmaddsub213ps %ymm8, %ymm9, %ymm10 #35.43
vmovups %ymm11, 32(%rbx,%rdx,8) #38.5
vmovups %ymm10, 32(%rax) #37.5
vmovsldup (%rbx,%rdx,8), %ymm4 #35.43
vfmaddsub213ps %ymm2, %ymm3, %ymm4 #35.43
vmovups %ymm5, (%rbx,%rdx,8) #38.5
vmovups %ymm4, (%rax) #37.5
addq $8, %rdx #32.3
cmpq $512, %rdx #32.3
jb ..B1.6 # Prob 99% #32.3
The command line used for icc is:
icc -march=core-avx2 -S -fsource-asm -c test.c
For GCC, what I've already done include: replace "#pragma ivdep" with "#pragma GCC ivdep", replace "__assume_aligned(realptr1, 64);" with "realptr1 = __builtin_assume_aligned(realptr1, 64);"
The command for GCC is:
gcc -c -O2 -ftree-vectorize -mavx2 -g -Wa,-a,-ad gcctest.c
and the result for the same loop is something like this:
109 .L7:
110 00d8 C5FA103B vmovss (%rbx), %xmm7
111 00dc 4883C308 addq $8, %rbx
112 00e0 C5FA1073 vmovss -4(%rbx), %xmm6
112 FC
113 00e5 4983C408 addq $8, %r12
114 00e9 C4C17A10 vmovss -8(%r12), %xmm5
114 6C24F8
115 00f0 C4C17A10 vmovss -4(%r12), %xmm4
115 6424FC
116 .LBB2:
117 .loc 1 35 0 discriminator 3
118 00f7 C5F828C7 vmovaps %xmm7, %xmm0
119 00fb C5F828CE vmovaps %xmm6, %xmm1
120 00ff C5FA1165 vmovss %xmm4, -80(%rbp)
120 B0
121 0104 C5F828DC vmovaps %xmm4, %xmm3
122 0108 C5FA116D vmovss %xmm5, -76(%rbp)
122 B4
123 010d C5F828D5 vmovaps %xmm5, %xmm2
124 0111 C5FA1175 vmovss %xmm6, -72(%rbp)
124 B8
125 0116 C5FA117D vmovss %xmm7, -68(%rbp)
125 BC
126 011b E8000000 call __mulsc3
126 00
127 .LVL7:
128 .loc 1 38 0 discriminator 3
129 0120 C5FA107D vmovss -68(%rbp), %xmm7
129 BC
130 0125 C5FA106D vmovss -76(%rbp), %xmm5
130 B4
131 012a C5FA1075 vmovss -72(%rbp), %xmm6
131 B8
132 012f C5D258EF vaddss %xmm7, %xmm5, %xmm5
133 0133 C5FA1065 vmovss -80(%rbp), %xmm4
133 B0
134 .loc 1 35 0 discriminator 3
135 0138 C5F9D645 vmovq %xmm0, -56(%rbp)
135 C8
136 .loc 1 38 0 discriminator 3
137 013d C5DA58E6 vaddss %xmm6, %xmm4, %xmm4
138 .loc 1 35 0 discriminator 3
139 0141 C5FA1045 vmovss -52(%rbp), %xmm0
139 CC
140 .LVL8:
141 .loc 1 37 0 discriminator 3
142 0146 C5FA104D vmovss -56(%rbp), %xmm1
142 C8
143 014b C5FA114B vmovss %xmm1, -8(%rbx)
143 F8
144 .LVL9:
145 0150 C5FA1143 vmovss %xmm0, -4(%rbx)
145 FC
146 .loc 1 38 0 discriminator 3
147 0155 C4C17A11 vmovss %xmm5, -8(%r12)
147 6C24F8
148 015c C4C17A11 vmovss %xmm4, -4(%r12)
148 6424FC
149 .LBE2:
150 .loc 1 32 0 discriminator 3
151 0163 4C39EB cmpq %r13, %rbx
152 0166 0F856CFF jne .L7
152 FFFF
So, I can see that GCC uses some kind of vector instructions, but still it it not as good as ICC.
My question is that, are there any more options I can do to make GCC perform better?
Thanks a lot.

You didn't post full code to test but you may start with adding
-ffast-math
and optionally
-mfma
so more or less you will end up with
vmovaps ymm0, YMMWORD PTR [rbx+rax]
vmovaps ymm3, YMMWORD PTR [r12+rax]
vpermilps ymm2, ymm0, 177
vpermilps ymm4, ymm3, 245
vpermilps ymm1, ymm3, 160
vmulps ymm2, ymm2, ymm4
vmovaps ymm4, ymm0
vfmsub132ps ymm4, ymm2, ymm1
vfmadd132ps ymm1, ymm2, ymm0
vaddps ymm0, ymm0, ymm3
vmovaps YMMWORD PTR [rbx+rax], ymm0
vblendps ymm1, ymm4, ymm1, 170
vmovaps YMMWORD PTR [r12+rax], ymm1
add rax, 32
cmp rax, 4096
jne .L6

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Significant FMA performance anomaly experienced in the Intel Broadwell processor - performance

Related

GCC for Aarch64: what generated NOPs are used for?

Access violation while requesting memory in getmem.inc

Is there a bottleneck when accessing "original" registers on an i7?

Performance difference between two seemingly equivalent assembly codes

How to make GCC generate vector instructions as ICC does?

Categories

Resources