Julia: Struct with constant field to optimize performance

Julia: Struct with constant field to optimize performance - performance

foo is a callable example struct that will have additional fields later. Since some of these fields hold arrays with changing values it's not an option to make the instance a const G. However, the boolean value b stays constant. Is there a way to tell the compiler that this field will never change in order to enable optimizations? In this example I would like F to be as fast as G and H.
using BenchmarkTools
struct foo
b::Bool
end
function (f::foo)(x)
if f.b
x+1
end
end
F = foo(true)
const G = foo(true)
H(x) = x+1
#btime F(1) # 12.078 ns (0 allocations: 0 bytes)
#btime G(1) # 0.023 ns (0 allocations: 0 bytes)
#btime H(1) # 0.024 ns (0 allocations: 0 bytes)

You are not benchmarking the F correctly. Here is a way to do it:
julia> #btime $F(1)
0.026 ns (0 allocations: 0 bytes)
2
The problem is not that F is a global non-constant variable while G is a constant, so when accessing F its value is not stable within #btime.
Adding a $ in front of the call makes F a local variable during a benchmark and you can then see that it is equally fast.
Also it is probably better to benchmark some larger functions in such cases. Here is a quick example:
julia> function test(x)
s = 0
for i in 1:10^6
s += x(1)
end
s
end
test (generic function with 1 method)
julia> #btime test($F)
35.830 μs (0 allocations: 0 bytes)
2000000
julia> #btime test($G)
35.839 μs (0 allocations: 0 bytes)
2000000
Also you can check with #code_native that F and G end up with identical native code:
julia> #code_native F(1)
.text
; ┌ # REPL[10]:2 within `foo'
cmpb $0, (%rsi)
je L17
; │ # REPL[10]:3 within `foo'
; │┌ # int.jl:53 within `+'
addq $1, %rdx
; │└
movq %rdx, (%rdi)
movb $2, %dl
xorl %eax, %eax
retq
L17:
movb $1, %dl
xorl %eax, %eax
; │ # REPL[10]:3 within `foo'
retq
nopw %cs:(%rax,%rax)
; └
julia> #code_native G(1)
.text
; ┌ # REPL[10]:2 within `foo'
cmpb $0, (%rsi)
je L17
; │ # REPL[10]:3 within `foo'
; │┌ # int.jl:53 within `+'
addq $1, %rdx
; │└
movq %rdx, (%rdi)
movb $2, %dl
xorl %eax, %eax
retq
L17:
movb $1, %dl
xorl %eax, %eax
; │ # REPL[10]:3 within `foo'
retq
nopw %cs:(%rax,%rax)
; └

Related

Faster bitwise AND operation on byte slices

I'd like to perform bitwise AND on every column of a byte matrix, which is stored in a [][]byte in golang. I created a repo with runnable test code.
It can be simplified as a bitwise AND operation on two byte-slice of equal length. The simplest way is using for loop to handle every pair of bytes.
func and(x, y []byte) []byte {
z := make([]byte, lenght(x))
for i:= 0; i < len(x); i++ {
z[i] = x[i] & y[i]
}
return z
}
However, it's very slow for long slices. A faster way is to unroll the for loop (check the benchmark result)
BenchmarkLoop-16 14467 84265 ns/op
BenchmarkUnrollLoop-16 17668 67550 ns/op
Any faster way? Go assembly?
Thank you in advance.

I write a go assembly implementation using AVX2 instructions after two days of (go) assembly learning.
The performance is good, 10X of simple loop version. While optimizations for compatibility and performance are still needed. Suggestions and PRs are welcome.
Note: code and benchmark results are updated.
I appreciate #PeterCordes for many valuable suggestions.
#include "textflag.h"
// func AND(x []byte, y []byte)
// Requires: AVX
TEXT ·AND(SB), NOSPLIT|NOPTR, $0-48
// pointer of x
MOVQ x_base+0(FP), AX
// length of x
MOVQ x_len+8(FP), CX
// pointer of y
MOVQ y_base+24(FP), DX
// --------------------------------------------
// end address of x, will not change: p + n
MOVQ AX, BX
ADDQ CX, BX
// end address for loop
// n <= 8, jump to tail
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// n < 32, jump to loop16
CMPQ CX, $0x00000020
JL loop16_start
// --------------------------------------------
// end address for loop32
MOVQ BX, CX
SUBQ $0x0000001f, CX
loop32:
// compute x & y, and save value to x
VMOVDQU (AX), Y0
VANDPS (DX), Y0, Y0
VMOVDQU Y0, (AX)
// move pointer
ADDQ $0x00000020, AX
ADDQ $0x00000020, DX
CMPQ AX, CX
JL loop32
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// --------------------------------------------
loop16_start:
// end address for loop16
MOVQ BX, CX
SUBQ $0x0000000f, CX
loop16:
// compute x & y, and save value to x
VMOVDQU (AX), X0
VANDPS (DX), X0, X0
VMOVDQU X0, (AX)
// move pointer
ADDQ $0x00000010, AX
ADDQ $0x00000010, DX
CMPQ AX, CX
JL loop16
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// --------------------------------------------
loop8_start:
// end address for loop8
MOVQ BX, CX
SUBQ $0x00000007, CX
loop8:
// compute x & y, and save value to x
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
// move pointer
ADDQ $0x00000008, AX
ADDQ $0x00000008, DX
CMPQ AX, CX
JL loop8
// --------------------------------------------
tail:
// left elements (<=8)
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
RET
Benchmark result:
test data-size time
------------------- --------- -----------
BenchmarkGrailbio 8.00_B 4.654 ns/op
BenchmarkGoAsm 8.00_B 4.824 ns/op
BenchmarkUnrollLoop 8.00_B 6.851 ns/op
BenchmarkLoop 8.00_B 8.683 ns/op
BenchmarkGrailbio 16.00_B 5.363 ns/op
BenchmarkGoAsm 16.00_B 6.369 ns/op
BenchmarkUnrollLoop 16.00_B 10.47 ns/op
BenchmarkLoop 16.00_B 13.48 ns/op
BenchmarkGoAsm 32.00_B 6.079 ns/op
BenchmarkGrailbio 32.00_B 6.497 ns/op
BenchmarkUnrollLoop 32.00_B 17.46 ns/op
BenchmarkLoop 32.00_B 21.09 ns/op
BenchmarkGoAsm 128.00_B 10.52 ns/op
BenchmarkGrailbio 128.00_B 14.40 ns/op
BenchmarkUnrollLoop 128.00_B 56.97 ns/op
BenchmarkLoop 128.00_B 80.12 ns/op
BenchmarkGoAsm 256.00_B 15.48 ns/op
BenchmarkGrailbio 256.00_B 23.76 ns/op
BenchmarkUnrollLoop 256.00_B 110.8 ns/op
BenchmarkLoop 256.00_B 147.5 ns/op
BenchmarkGoAsm 1.00_KB 47.16 ns/op
BenchmarkGrailbio 1.00_KB 87.75 ns/op
BenchmarkUnrollLoop 1.00_KB 443.1 ns/op
BenchmarkLoop 1.00_KB 540.5 ns/op
BenchmarkGoAsm 16.00_KB 751.6 ns/op
BenchmarkGrailbio 16.00_KB 1342 ns/op
BenchmarkUnrollLoop 16.00_KB 7007 ns/op
BenchmarkLoop 16.00_KB 8623 ns/op

NASM x64 on macOS using mach_absolute_time - Trouble getting nanoseconds (working code review)

I am just learning NASM so sorry if I am making some obvious mistake but I cannot understand what I am doing wrong.
Please look into the code below and let me know what is incorrect. It compiles & runs ok, but prints garbage as a result.
I know that the info coming from mach_absolute_time is hardware dependent and so it needs to be adjusted with the info from the struct from mach_timebase_info.
I created the below test program that artificially takes 1 sec to execute.
It prints the start, end and elapsed absolute mach time info, (that curiously in my machine displays the correct amount of nanoseconds).
But the calculated nanoseconds are garbage - probably related to some error I am making with the math / use of xmm registers and data sizes but for the love of me cannot figure it out.
Thanks for the help!
Example run:
; ----------------------------------------------------------------------------------------
; Testing mach_absolute_time
; nasm -fmacho64 mach.asm && gcc -o mach mach.o
; ----------------------------------------------------------------------------------------
global _main
extern _printf
extern _mach_absolute_time
extern _mach_timebase_info
extern _nanosleep
default rel
section .text
_main:
push rbx ; aligns the stack x C calls
; start measurement
call _mach_absolute_time ; get the absolute time hardware dependant
mov [start], rax ; save start in start
; print start
lea rdi, [time_absolute]
mov rsi, rax
call _printf
; do some time intensive stuff - This simulates 1 sec work
lea rdi, [timeval]
call _nanosleep
; end measurement
call _mach_absolute_time
mov [end], rax
; print end
lea rdi, [time_absolute]
mov rsi, rax
call _printf
; calc elapsed
mov r10d, [end]
mov r11d, [start]
sub r10d, r11d ; r10d = end - start
mov [diff], r10d ; copy to diff
mov rax, [diff] ; diff to rax to print as int
cvtsi2ss xmm2, r10d ; diff to xmm2 to calc nanoseconds
; print elapsed
lea rdi, [diff_absolute]
mov rsi, rax
call _printf
; get conversion factor to get nanoseconds and store numerator and denominator
; in xmm0 and xmm1
lea rdi, [timebase_info]
call _mach_timebase_info ; get conversion factor to nanoseconds
movss xmm0, [numer]
movss xmm1, [denom]
; print numerator & denominator as float to ensure I am getting the info into xmm regs
lea rdi, [time_base]
mov rax, 2
call _printf
; calc nanoseconds - xmm0 ends with nanoseconds
mulss xmm0, xmm2 ; multiply elapsed * numerator
divss xmm0, xmm1 ; divide by the denominator
; print nanoseconds as float
lea rdi, [nanosecs_calc]
mov rax, 1 ; 1 non-int argument
call _printf
pop rbx ; undoes the stack alignment push
ret
section .data
; _mach_timebase_info call struct
timebase_info:
numer db 8
denom db 8
; lazy way to set up 1 sec wait
timeval:
tv_sec dq 1
tv_usec dq 0
time_absolute: db "mach_absoute_time: %ld", 10, 0
diff_absolute: db "absoute_time diff: %ld", 10, 0
time_base: db "numerator: %g, denominator: %g", 10, 0
nanosecs_calc: db "calc nanoseconds: %ld", 10, 0
; using %g format also prints garbage
; nanosecs_calc: db "calc nanoseconds: %g", 10, 0
; should use registers but for clarity
start: dq 0
end: dq 0
diff: dq 0

EDIT: I know what was wrong. xmm reges got cleared after c calls and that is why multiplication and result failed. Anyway the C workaround below to get the timebase ratio works ok, with the full code of the test below.
The workaround is to get the ratio from mach_timebase_info from a short C function, and multiply that to the result from mach_absolute_time to get nanoseconds.
As suspected in my actual hardware (late 2013 MBP 2.3 i7), mach_absolute_time already returns nanoseconds so the factor as printed by C is 1.000.
(timebase numerator = 1, timabase denominator = 1)
#include <stdio.h>
#include <mach/mach_time.h>
double timebase() {
double ratio;
mach_timebase_info_data_t tb;
mach_timebase_info(&tb);
ratio = tb.numer / tb.denom;
printf("num: %u, den: %u\n", tb.numer, tb.denom);
printf("ratio from C: %.3f\n", ratio);
return ratio;
}
NASM:
global _main
extern _printf
extern _mach_absolute_time
extern _timebase
extern _nanosleep
default rel
section .text
_main:
push rbx ; aligns the stack x C calls
; start measurement
call _mach_absolute_time ; get the absolute time hardware dependant
mov [start], rax ; save start in start
; print start
lea rdi, [time_absolute]
mov rsi, rax
call _printf
; do some time intensive stuff - This simulates 1 sec work
lea rdi, [timeval]
call _nanosleep
; end measurement
call _mach_absolute_time
mov [end], rax
; print end
lea rdi, [time_absolute]
mov rsi, rax
call _printf
; calc elapsed
mov r10d, [end]
mov r11d, [start]
sub r10d, r11d ; r10d = end - start
mov [diff], r10d ; copy to diff
mov rax, [diff] ; diff to rax to print as int
; print elapsed
lea rdi, [diff_absolute]
mov rsi, [diff]
call _printf
; get conversion ratio from C function
call _timebase ; get conversion ratio to nanoseconds into xmm0
cvtsi2sd xmm1, [diff] ; load diff from mach_absolute time in [diff]
; if you do it before register gets cleared
; calc nanoseconds - xmm0 ends with nanoseconds
; in my hardware ratio is 1.0 so mach_absolute_time = nanoseconds
mulsd xmm0, xmm1
cvtsd2si rax, xmm0
mov [result], rax ; save to result
; print nanoseconds as int
lea rdi, [nanosecs_calc]
mov rsi, [result]
call _printf
pop rbx ; undoes the stack alignment push
ret
section .data
; lazy way to set up 1 sec wait
timeval:
tv_sec dq 1
tv_usec dq 0
time_absolute: db "mach_absoute_time: %ld", 10, 0
diff_absolute: db "absoute_time diff: %ld", 10, 0
nanosecs_calc: db "nanoseconds: %ld", 10, 0
; should use registers but for clarity
start: dq 0
end: dq 0
diff: dq 0
result: dq 0

Performance of <: Any in Julia

I am new to Julia, and I am trying to understand performance implications of some of the constructs before getting used to bad habits. Currently, I am trying to understand the type system of Julia, especially the <: Any type annotation. As far as I understand, <: Any should stand for I don't care about the type.
Consider the following code
struct Container{T}
parametric::T
nonparametric::Int64
end
struct TypeAny
payload::Container{<: Any}
end
struct TypeKnown
payload::Container{Array{Int64,1}}
end
getparametric(x) = x.payload.parametric[1]
getnonparametric(x) = x.payload.nonparametric
xany = TypeAny(Container([1], 2))
xknown = TypeKnown(Container([1], 2))
#time for i in 1:10000000 getparametric(xany) end # 0.212002s
#time for i in 1:10000000 getparametric(xknown) end # 0.110531s
#time for i in 1:10000000 getnonparametric(xany) end # 0.173390s
#time for i in 1:10000000 getnonparametric(xknown) end # 0.086739s
First of all, I was surprised that getparametric(xany) works in the first place when it operates on a field Container{<: Any}.parametric of unknown type. How is that possible and what are the performance implications of such construct? Is Julia doing some kind of runtime reflection behind the scenes to make this possible, or something more sophisticated is going on?
Second, I was surprised by the difference in runtime between the calls getnonparametric(xany) and getnonparametric(xknown) which contradicts my intuition of using type annotation <: Any as an I don't care annotation. Why the call to getnonparametric(xany) is significantly slower, even though I use only a field of known type? And how to ignore type in case I do not want to use any variables of that type without taking a performance hit? (In my use case, it seems not to be possible to specify the concrete type as that would lead to infinitely recursive type definitions - but that could be caused by improper design of my code which is out of the scope of this question.)

<: Any should stand for I don't care about the type.
It is something like it can be any type (so compiler does not get any hint about the type). You could also write it as:
struct TypeAny
payload::Container
end
which is essentially the same as you can check using the following test:
julia> Container{<:Any} <: Container
true
julia> Container <: Container{<:Any}
true
How is that possible and what are the performance implications of such construct?
The performance implication is that the concrete type of the object you hold in your container is determined in run-time not in compile time (just as you have suspected).
Note however that if you pass such extracted object to a function then after a dynamic dispatch inside the called function the code will run fast (as it will be type stable). You can read more about it here.
or something more sophisticated is going on?
The more sophisticated thing happens for bits types. If a concrete bits type is a field in a container then it is stored as value. If its type is not known at compile time it will be stored as reference (which has yet additional memory and run time impact).
I was surprised by the difference in runtime between the calls
As commented above, the difference is due to the fact that at compile time the type of the field is not known. If you changed your definition to:
struct TypeAny{T}
payload::Container{T}
end
then you say I do not care about type, but store it in a parameter, so that compiler knows this type.
Then the type of payload would be known at compile time and all would be fast.
If something I have written above is not clear or you need some more explanations please comment and I will expand the answer.
As a side note - it is usually better to use BenchmarkTools.jl for performance analysis of your code (unless you want to measure compilation time also).
EDIT
Look at:
julia> loop(x) = for i in 1:10000000 getnonparametric(x) end
loop (generic function with 1 method)
julia> #code_native loop(xknown)
.text
; ┌ # REPL[14]:1 within `loop'
pushq %rbp
movq %rsp, %rbp
pushq %rax
movq %rdx, -8(%rbp)
movl $74776584, %eax # imm = 0x4750008
addq $8, %rsp
popq %rbp
retq
; └
julia> #code_native loop(xany)
.text
; ┌ # REPL[14]:1 within `loop'
pushq %rbp
movq %rsp, %rbp
pushq %rax
movq %rdx, -8(%rbp)
movl $74776584, %eax # imm = 0x4750008
addq $8, %rsp
popq %rbp
retq
; └
And you see that the compiler is smart enough to optimize-out the whole loop (as it is essentially a no-op). This is a power of Julia (but on the other hand - makes benchmarking hard sometimes).
Here is an example that shows you a more accurate view (note that I use a more complex expression, as even very simple expressions in loops can be optimized out by the compiler):
julia> xknowns = fill(xknown, 10^6);
julia> xanys = fill(xany, 10^6);
julia> #btime sum(getnonparametric, $xanys)
12.373 ms (0 allocations: 0 bytes)
2000000
julia> #btime sum(getnonparametric, $xknowns)
519.700 μs (0 allocations: 0 bytes)
2000000
Note that even in this case the compiler is "smart enough" to properly infer the return type of the expression in both cases as you access nonparametric field in both cases:
julia> #code_warntype sum(getnonparametric, xanys)
Variables
#self#::Core.Compiler.Const(sum, false)
f::Core.Compiler.Const(getnonparametric, false)
a::Array{TypeAny,1}
Body::Int64
1 ─ nothing
│ %2 = Base.:(#sum#559)(Base.:(:), #self#, f, a)::Int64
└── return %2
julia> #code_warntype sum(getnonparametric, xknowns)
Variables
#self#::Core.Compiler.Const(sum, false)
f::Core.Compiler.Const(getnonparametric, false)
a::Array{TypeKnown,1}
Body::Int64
1 ─ nothing
│ %2 = Base.:(#sum#559)(Base.:(:), #self#, f, a)::Int64
└── return %2
The core of the difference can be seen when you look at native code generated in both cases:
julia> #code_native getnonparametric(xany)
.text
; ┌ # REPL[6]:1 within `getnonparametric'
pushq %rbp
movq %rsp, %rbp
; │┌ # Base.jl:20 within `getproperty'
subq $48, %rsp
movq (%rcx), %rax
movq %rax, -16(%rbp)
movq $75966808, -8(%rbp) # imm = 0x4872958
movabsq $jl_f_getfield, %rax
leaq -16(%rbp), %rdx
xorl %ecx, %ecx
movl $2, %r8d
callq *%rax
; │└
movq (%rax), %rax
addq $48, %rsp
popq %rbp
retq
nopl (%rax,%rax)
; └
julia> #code_native getnonparametric(xknown)
.text
; ┌ # REPL[6]:1 within `getnonparametric'
pushq %rbp
movq %rsp, %rbp
; │┌ # Base.jl:20 within `getproperty'
movq (%rcx), %rax
; │└
movq 8(%rax), %rax
popq %rbp
retq
nopl (%rax)
; └
If you add parameter to the type all is working as expected:
julia> struct Container{T}
parametric::T
nonparametric::Int64
end
julia> struct TypeAny2{T}
payload::Container{T}
end
julia> xany2 = TypeAny2(Container([1], 2))
TypeAny2{Array{Int64,1}}(Container{Array{Int64,1}}([1], 2))
julia> #code_native getnonparametric(xany2)
.text
; ┌ # REPL[9]:1 within `getnonparametric'
pushq %rbp
movq %rsp, %rbp
; │┌ # Base.jl:20 within `getproperty'
movq (%rcx), %rax
; │└
movq 8(%rax), %rax
popq %rbp
retq
nopl (%rax)
; └
And you have:
julia> xany2s = fill(xany2, 10^6);
julia> #btime sum(getnonparametric, $xany2s)
528.699 μs (0 allocations: 0 bytes)
2000000
Summary
Always try to use containers that do not have fields of abstract type if you want performance.
Sometimes if condition in point 1. is not met the compiler can handle it efficiently and generate a fast machine code, but it is not guaranteed in general (so still the recommendation in point 1. applies).

What is the reason for different performance of the same implementation using icc, gcc and clang?

I have implemented a program for a[i]=a[i-1]+c and I represent it her. I use begin_rdtsc and end_rdtsc to read and store the rdtsc to measure the speedup.
The program is as follows, I use x86intrin.h
#define MAX1 512
#define LEN MAX1*MAX1 //array size for time measure ments
int __attribute__(( aligned(32))) a[LEN];
int main(){
singleCore // It's a macro to assign the program to a single core of the processor
int i, b, c;
begin_rdtsc
// b=1 and c=2 in this case
b = 1;
c = 2;
i = 0;
a[i++] = b;//0 --> a[0] = 1
//step 1:
//solving dependencies vectorization factor is 8
a[i++] = a[0] + 1*c; //1 --> a[1] = 1 + 2 = 3
a[i++] = a[0] + 2*c; //2 --> a[2] = 1 + 4 = 5
a[i++] = a[0] + 3*c; //3 --> a[3] = 1 + 6 = 7
a[i++] = a[0] + 4*c; //4 --> a[4] = 1 + 8 = 9
a[i++] = a[0] + 5*c; //5 --> a[5] = 1 + 10 = 11
a[i++] = a[0] + 6*c; //6 --> a[6] = 1 + 12 = 13
a[i++] = a[0] + 7*c; //7 --> a[7] = 1 + 14 = 15
// vectorization factor reached
// 8 *c will work for all
//loading the results to an vector
__m256i dep1;
//__m256i dep2; // dep = { 1, 3, 5, 7, 9, 11, 13, 15 }
__m256i coeff = _mm256_set1_epi32(8*c); //coeff = { 16, 16, 16, 16, 16, 16, 16, 16 }
//step2
for(; i<LEN-1; i+=8){
dep1 = _mm256_load_si256((__m256i *) &a[i-8]);
dep1 = _mm256_add_epi32(dep1, coeff);
_mm256_store_si256((__m256i *) &a[i], dep1);
}
end_rdtsc
return 0;
}
I compiled this program with different compilers. My compilers are :
icc 18, gcc 7.2, clang 4.
The OS is fedora 27.
The CPU is Corei7 6700HQ (Skylake)
The scalar implementation which is compiled with icc -D _GNU_SOURCE -O3 -no-vec -march=native is the baseline for speedup measurements.
The asm output for each compiler is as follows: Because the behavior of ICC is not normal I copied all the code for icc. I marked the section in C program ("mm...mm1/2").
ICC
# mark_description "Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.1.163 Build 20171018";
# mark_description "-D _GNU_SOURCE -O3 -no-vec -march=native -c -S -o AIC3iccnovec";
.file "AIC3.c"
.text
..TXTST0:
.L_2__routine_start_main_0:
# -- Begin main
.text
# mark_begin;
.align 16,0x90
.globl main
# --- main()
main:
..B1.1: # Preds ..B1.0
# Execution count [1.00e+00]
.cfi_startproc
..___tag_value_main.1:
..L2:
#7.11
pushq %rbp #7.11
.cfi_def_cfa_offset 16
movq %rsp, %rbp #7.11
.cfi_def_cfa 6, 16
.cfi_offset 6, -16
andq $-128, %rsp #7.11
subq $128, %rsp #7.11
xorl %esi, %esi #7.11
movl $3, %edi #7.11
call __intel_new_feature_proc_init #7.11
# LOE rbx r12 r13 r14 r15
..B1.21: # Preds ..B1.1
# Execution count [1.00e+00]
vstmxcsr (%rsp) #7.11
vpxor %ymm0, %ymm0, %ymm0 #9.2
orl $32832, (%rsp) #7.11
vldmxcsr (%rsp) #7.11
vmovups %ymm0, mask(%rip) #9.2
vmovups %ymm0, 32+mask(%rip) #9.2
vmovups %ymm0, 64+mask(%rip) #9.2
vmovups %ymm0, 96+mask(%rip) #9.2
# LOE rbx r12 r13 r14 r15
..B1.2: # Preds ..B1.21
# Execution count [5.00e-01]
xorl %edi, %edi #9.2
movl $128, %esi #9.2
movl $mask, %edx #9.2
orq $12, mask(%rip) #9.2
vzeroupper #9.2
..___tag_value_main.6:
# sched_setaffinity(__pid_t, size_t, const cpu_set_t *)
call sched_setaffinity #9.2
..___tag_value_main.7:
# LOE rbx r12 r13 r14 r15
..B1.3: # Preds ..B1.2
# Execution count [1.72e+00]
movq $0xdf84757ff, %rax #12.5
movq $.L_2__STRING.1, programName(%rip) #10.2
movq $100000000, elapsed_rdtsc(%rip) #12.5
movq %rax, overal_time(%rip) #12.5
movq $0, ttime(%rip) #12.5
vmovdqu .L_2il0floatpacket.2(%rip), %ymm0 #33.21
# LOE rbx r12 r13 r14 r15
..B1.4: # Preds ..B1.12 ..B1.3
# Execution count [2.91e+00]
# Begin ASM
# #mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm1
# End ASM
# LOE rbx r12 r13 r14 r15
..B1.23: # Preds ..B1.4
# Execution count [2.91e+00]
vzeroupper #12.5
rdtsc #12.5
shlq $32, %rdx #12.5
orq %rdx, %rax #12.5
# LOE rax rbx r12 r13 r14 r15
..B1.5: # Preds ..B1.23
# Execution count [2.62e+00]
movq %rax, t1_rdtsc(%rip) #12.5
xorl %edx, %edx #35.5
movl $1, a(%rip) #18.5
xorl %eax, %eax #35.5
movl $3, 4+a(%rip) #21.5
movl $5, 8+a(%rip) #21.5
movl $7, 12+a(%rip) #21.5
movl $9, 16+a(%rip) #21.5
movl $11, 20+a(%rip) #21.5
movl $13, 24+a(%rip) #21.5
movl $15, 28+a(%rip) #21.5
vmovdqu .L_2il0floatpacket.2(%rip), %ymm1 #35.5
# LOE rax rbx r12 r13 r14 r15 edx ymm1
..B1.6: # Preds ..B1.6 ..B1.5
# Execution count [4.29e+04]
vpaddd a(%rax), %ymm1, %ymm0 #38.16
incl %edx #35.5
vmovdqu %ymm0, 32+a(%rax) #39.41
addq $32, %rax #35.5
cmpl $2047, %edx #35.5
jb ..B1.6 # Prob 99% #35.5
# LOE rax rbx r12 r13 r14 r15 edx ymm1
..B1.7: # Preds ..B1.6
# Execution count [2.91e+00]
vzeroupper #46.5
rdtsc #46.5
shlq $32, %rdx #46.5
orq %rdx, %rax #46.5
# LOE rax rbx r12 r13 r14 r15
..B1.8: # Preds ..B1.7
# Execution count [2.91e+00]
movq %rax, t2_rdtsc(%rip) #46.5
# LOE rbx r12 r13 r14 r15
..B1.26: # Preds ..B1.8
# Execution count [2.91e+00]
# Begin ASM
# #mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm2
# End ASM
# LOE rbx r12 r13 r14 r15
..B1.25: # Preds ..B1.26
# Execution count [2.91e+00]
movq t2_rdtsc(%rip), %rdx #46.5
subq t1_rdtsc(%rip), %rdx #46.5
movq ttbest_rdtsc(%rip), %rsi #46.5
movq %rdx, ttotal_rdtsc(%rip) #46.5
cmpq %rsi, %rdx #46.5
jge ..B1.10 # Prob 50% #46.5
# LOE rdx rbx rsi r12 r13 r14 r15
..B1.9: # Preds ..B1.25
# Execution count [1.45e+00]
movq elapsed_rdtsc(%rip), %rcx #46.5
movq %rcx, %rax #46.5
negq %rax #46.5
movq %rdx, %rsi #46.5
addq $100000000, %rax #46.5
movq %rdx, ttbest_rdtsc(%rip) #46.5
movq %rax, elapsed(%rip) #46.5
jmp ..B1.11 # Prob 100% #46.5
# LOE rdx rcx rbx rsi r12 r13 r14 r15
..B1.10: # Preds ..B1.25
# Execution count [1.45e+00]
movq elapsed_rdtsc(%rip), %rcx #46.5
# LOE rdx rcx rbx rsi r12 r13 r14 r15
..B1.11: # Preds ..B1.9 ..B1.10
# Execution count [2.91e+00]
movq ttime(%rip), %rax #46.5
addq %rdx, %rax #46.5
movq %rax, ttime(%rip) #46.5
testq %rcx, %rcx #46.5
je ..B1.14 # Prob 50% #46.5
# LOE rax rcx rbx rsi r12 r13 r14 r15
..B1.12: # Preds ..B1.11
# Execution count [1.45e+00]
decq %rcx #46.5
movq %rcx, elapsed_rdtsc(%rip) #46.5
cmpq overal_time(%rip), %rax #46.5
jl ..B1.4 # Prob 82% #46.5
jmp ..B1.15 # Prob 100% #46.5
# LOE rcx rbx rsi r12 r13 r14 r15
..B1.14: # Preds ..B1.11
# Execution count [1.45e+00]
movq $-1, elapsed_rdtsc(%rip) #46.5
movq $-1, %rcx #46.5
# LOE rcx rbx rsi r12 r13 r14 r15
..B1.15: # Preds ..B1.12 ..B1.14
# Execution count [1.00e+00]
negq %rcx #46.5
movl $.L_2__STRING.2, %edi #46.5
addq $100000000, %rcx #46.5
xorl %eax, %eax #46.5
movq elapsed(%rip), %rdx #46.5
..___tag_value_main.8:
# printf(const char *__restrict__, ...)
call printf #46.5
..___tag_value_main.9:
# LOE rbx r12 r13 r14 r15
..B1.16: # Preds ..B1.15
# Execution count [1.00e+00]
movl $.L_2__STRING.3, %edi #46.5
movl $.L_2__STRING.4, %esi #46.5
# fopen(const char *__restrict__, const char *__restrict__)
call fopen #46.5
# LOE rax rbx r12 r13 r14 r15
..B1.17: # Preds ..B1.16
# Execution count [1.00e+00]
movl $128, %ecx #46.5
movq %rax, %rdi #46.5
movq %rax, fileForSpeedups(%rip) #46.5
movl $.L_2__STRING.5, %esi #46.5
movl %ecx, %r8d #46.5
xorl %eax, %eax #46.5
movq programName(%rip), %rdx #46.5
movq ttbest_rdtsc(%rip), %r9 #46.5
# fprintf(FILE *__restrict__, const char *__restrict__, ...)
call fprintf #46.5
# LOE rbx r12 r13 r14 r15
..B1.18: # Preds ..B1.17
# Execution count [1.00e+00]
xorl %eax, %eax #47.9
movq %rbp, %rsp #47.9
popq %rbp #47.9
.cfi_def_cfa 7, 8
.cfi_restore 6
ret #47.9
.align 16,0x90
# LOE
.cfi_endproc
# mark_end;
.type main,#function
.size main,.-main
..LNmain.0:
.data
# -- End main
.bss
.align 8
.align 8
.globl fileForSpeedups
fileForSpeedups:
.type fileForSpeedups,#object
.size fileForSpeedups,8
.space 8 # pad
.align 8
.globl ttime
ttime:
.type ttime,#object
.size ttime,8
.space 8 # pad
.data
.align 8
.align 8
.globl programName
programName:
.quad .L_2__STRING.0
.type programName,#object
.size programName,8
.align 8
.globl ttbest_rdtsc
ttbest_rdtsc:
.long 0x5d89ffff,0x01634578
.type ttbest_rdtsc,#object
.size ttbest_rdtsc,8
.align 8
.globl elapsed_rdtsc
elapsed_rdtsc:
.long 0x05f5e100,0x00000000
.type elapsed_rdtsc,#object
.size elapsed_rdtsc,8
.align 8
.globl overal_time
overal_time:
.long 0xf84757ff,0x0000000d
.type overal_time,#object
.size overal_time,8
.section .rodata, "a"
.align 32
.align 32
.L_2il0floatpacket.2:
.long 0x00000010,0x00000010,0x00000010,0x00000010,0x00000010,0x00000010,0x00000010,0x00000010
.type .L_2il0floatpacket.2,#object
.size .L_2il0floatpacket.2,32
.section .rodata.str1.4, "aMS",#progbits,1
.align 4
.align 4
.L_2__STRING.1:
.long 860047681
.byte 0
.type .L_2__STRING.1,#object
.size .L_2__STRING.1,5
.space 3, 0x00 # pad
.align 4
.L_2__STRING.2:
.long 1701344266
.long 1936024096
.long 1936269428
.long 1819026720
.long 1852383332
.long 1819026720
.long 543716452
.long 1919251561
.long 1869182049
.long 1851859054
.long 1814372452
.long 1914725484
.long 1952804965
.long 1869182057
.long 684910
.type .L_2__STRING.2,#object
.size .L_2__STRING.2,60
.align 4
.L_2__STRING.3:
.long 1701603686
.long 1400008518
.long 1684366704
.long 7565429
.type .L_2__STRING.3,#object
.size .L_2__STRING.3,16
.align 4
.L_2__STRING.4:
.word 97
.type .L_2__STRING.4,#object
.size .L_2__STRING.4,2
.space 2, 0x00 # pad
.align 4
.L_2__STRING.5:
.long 539783973
.long 628646949
.long 622865508
.long 174353516
.byte 0
.type .L_2__STRING.5,#object
.size .L_2__STRING.5,17
.space 3, 0x00 # pad
.align 4
.L_2__STRING.0:
.word 32
.type .L_2__STRING.0,#object
.size .L_2__STRING.0,2
.data
.comm mask1,128,32
.comm t1_rdtsc,8,8
.comm t2_rdtsc,8,8
.comm ttotal_rdtsc,8,8
.comm elapsed,8,8
.comm mask,128,32
.comm a,65536,32
.section .note.GNU-stack, ""
// -- Begin DWARF2 SEGMENT .eh_frame
.section .eh_frame,"a",#progbits
.eh_frame_seg:
.align 8
# End
GCC
//gcc -D _GNU_SOURCE -O3 -fno-tree-vectorize -fno-tree-slp-vectorize -march=native -c -S -o "AIC3" "AIC3.c"
rdtsc
salq $32, %rdx
movq %r10, a(%rip)
orq %rdx, %rax
movq %r9, a+8(%rip)
movq %r8, a+16(%rip)
movq %rdi, a+24(%rip)
vmovdqa a(%rip), %ymm1
movq %rax, t1_rdtsc(%rip)
movl $a+32, %eax
.p2align 4,,10
.p2align 3
.L2:
vpaddd %ymm1, %ymm2, %ymm0
addq $32, %rax
vmovdqa %ymm0, -32(%rax)
vmovdqa %ymm0, %ymm1
cmpq %rax, %rcx
jne .L2
rdtsc
Clang
//clang -D _GNU_SOURCE -O3 -fno-vectorize -fno-slp-vectorize -march=native -c -S -o "AIC3"clang "
rdtsc
shlq $32, %rdx
orq %rax, %rdx
movq %rdx, t1_rdtsc(%rip)
movq %r8, a(%rip)
movq %r9, a+8(%rip)
movq %r10, a+16(%rip)
movq %rcx, a+24(%rip)
vmovdqa a(%rip), %ymm8
movl $64, %eax
jmp .LBB0_2
.p2align 4, 0x90
.LBB0_9: # in Loop: Header=BB0_2 Depth=2
vpaddd %ymm7, %ymm8, %ymm8
vmovdqa %ymm8, a(,%rax,4)
addq $64, %rax
.LBB0_2: # Parent Loop BB0_1 Depth=1
# => This Inner Loop Header: Depth=2
vpaddd %ymm0, %ymm8, %ymm9
vmovdqa %ymm9, a-224(,%rax,4)
vpaddd %ymm1, %ymm8, %ymm9
vmovdqa %ymm9, a-192(,%rax,4)
vpaddd %ymm2, %ymm8, %ymm9
vmovdqa %ymm9, a-160(,%rax,4)
vpaddd %ymm3, %ymm8, %ymm9
vmovdqa %ymm9, a-128(,%rax,4)
vpaddd %ymm4, %ymm8, %ymm9
vmovdqa %ymm9, a-96(,%rax,4)
vpaddd %ymm5, %ymm8, %ymm9
vmovdqa %ymm9, a-64(,%rax,4)
vpaddd %ymm6, %ymm8, %ymm9
vmovdqa %ymm9, a-32(,%rax,4)
cmpq $16383, %rax # imm = 0x3FFF
jl .LBB0_9
# BB#3: # in Loop: Header=BB0_1 Depth=1
rdtsc
The speedups are ~1.30, ~4.10 and 4.00 using icc, gcc and clang, respectively.
As I mentioned, I've compiled the same code with different compilers and recorder the rdtsc. speedup for ICC is not as I expected.
I used IACA to watch the inner loop, the summarized output is:
-----------------------------------------------------
| compilers | icc | gcc | clang |
------------------------------------------------------
| Throughput |1.49 cycle |1.00 cycle |1.49 cycle |
------------------------------------------------------
| bottleneck | Front End | dependency | Front End |
------------------------------------------------------
UPDATE-0 : I've compared with and without IACA generated codes. The reason that IACA does not help, in this case, is the outputs are not the same. It seems injecting the IACA marks forces the compilers to stop their optimization, GCC has the same generated code as ICC and Clang has. But, calculating the addresses in GCC is more efficient in throughput point of view. In summary, IACA cannot help for this code.
UPDATE-1 : The outputs for perf is as follows:
512*512
ICC:
86.06 │loop: vpaddd 0x604580(%rax),%ymm1,%ymm0
0.17 │ inc %edx
4.73 │ vmovdq %ymm0,0x6045a0(%rax)
│ add $0x20,%rax
│ cmp $0x7fff,%edx
8.98 │ jb loop
GCC:
30.62 │loop: vpaddd %ymm1,%ymm2,%ymm0
15.12 │ add $0x20,%rax
46.03 │ vmovdq %ymm0,-0x20(%rax)
2.40 │ vmovdq %ymm0,%ymm1
0.01 │ cmp %rax,%rcx
5.62 │ jne loop
LLVM:
3.00 │loop: vpaddd %ymm0,%ymm7,%ymm8
6.61 │ vmovdq %ymm8,0x6020e0(,%rax,4)
15.96 │ vpaddd %ymm1,%ymm7,%ymm8
5.19 │ vmovdq %ymm8,0x602100(,%rax,4)
1.89 │ vpaddd %ymm2,%ymm7,%ymm8
6.16 │ vmovdq %ymm8,0x602120(,%rax,4)
13.25 │ vpaddd %ymm3,%ymm7,%ymm8
8.01 │ vmovdq %ymm8,0x602140(,%rax,4)
2.10 │ vpaddd %ymm4,%ymm7,%ymm8
5.37 │ vmovdq %ymm8,0x602160(,%rax,4)
13.92 │ vpaddd %ymm5,%ymm7,%ymm8
7.95 │ vmovdq %ymm8,0x602180(,%rax,4)
0.89 │ vpaddd %ymm6,%ymm7,%ymm7
4.34 │ vmovdq %ymm7,0x6021a0(,%rax,4)
2.82 │ add $0x38,%rax
│ cmp $0x3ffff,%rax
2.24 │ jl loop
The ICC assembly output show that there is some SIMD instructions inside the rdtsc. If I miss something, or something is wrong I really have no idea. I spent a lot of time to realize the problem but zero achievement. Please, if somebody knows the reason help me.
Thanks in advance.

The different compilers actually use fairly different implementation strategies here.
GCC notices that it never has to re-load a[i-8] which was calculated in the previous iteration and therefore can be sourced from a register. This relies on mov-elimination somewhat, otherwise the reg-reg move would still add some latency, though even without mov-elimination it would be a lot faster than reloading every time.
ICC's codegen is very naive, it just does it exactly the way you wrote it. The store/reload adds quite a lot of latency.
Clang does approximately the same thing as GCC, but unrolls by 8 (minus the first iteration). Clang often likes to unroll more. I'm not sure why it's slightly worse than what GCC does.
You can avoid the reloading by explicitly not doing it in the first place: (not tested)
dep1 = _mm256_load_si256((__m256i *) &a[0]);
for(; i<LEN-1; i+=8){
dep1 = _mm256_add_epi32(dep1, coeff);
_mm256_store_si256((__m256i *) &a[i], dep1);
}

x86: Long loop-carried dependency chain. Why 13 cycles?

I modified the code from a previous experiment (Agner Fog's Optimizing Assembly, example 12.10a) to make it more dependent:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov eax, coeff
L1:
movsd xmm3, [eax]
mulsd xmm3, xmm1
mulsd xmm1, xmm2
addsd xmm1, xmm3
add eax, 8
cmp eax, coeff_end
jb L1
And now it takes ~13 cycles per iteration, but I have no idea why so much.
Please help me understand.
(update)
I'm sorry. Yes, definetely #Peter Cordes is right- it takes 9 cycles per iteration in fact. The misunderstanding is caused by myself. I missed two similar pieces of codes ( instructions swapped), the 13-cycles code is here:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov eax, coeff
L1:
movsd xmm3, [eax]
mulsd xmm1, xmm2
mulsd xmm3, xmm1
addsd xmm1, xmm3
add eax, 8
cmp eax, coeff_end
jb L1

It runs at exactly one iteration per 9c for me, on a Core2 E6600, which is expected:
movsd xmm3, [eax] ; independent, depends only on eax
A: mulsd xmm3, xmm1 ; 5c: depends on xmm1:C from last iteration
B: mulsd xmm1, xmm2 ; 5c: depends on xmm1:C from last iteration
C: addsd xmm1, xmm3 ; 3c: depends on xmm1:B from THIS iteration (and xmm3:A from this iteration)
When xmm1:C is ready from iteration i, the next iteration can start calculating:
A: producing xmm3:A in 5c
B: producing xmm1:B in 5c (but there's a resource conflict; these multiplies can't both start in the same cycle in Core2 or IvyBridge, only Haswell and later)
Regardless of which one runs first, both have to finish before C can run. So the loop-carried dependency chain is 5 + 3 cycles, +1c for the resource conflict that stops both multiplies from starting in the same cycle.
Test code that runs at the expected speed:
This slows down to one iteration per ~11c when the array is 8B * 128 * 1024. If you're testing with an even bigger array instead of using a repeat-loop around what you posted, then that's why you're seeing a higher latency.
If a load arrives late, there's no way for the CPU to "catch up", since it delays the loop-carried dependency chain. If the load was only needed in a dependency chain that forked off from the loop-carried chain, then the pipeline could absorb an occasional slow load more easily. So, some loops can be more sensitive to memory delays than others.
default REL
%macro IACA_start 0
mov ebx, 111
db 0x64, 0x67, 0x90
%endmacro
%macro IACA_end 0
mov ebx, 222
db 0x64, 0x67, 0x90
%endmacro
global _start
_start:
movsd xmm2, [x]
movsd xmm1, [one]
xorps xmm0, xmm0
mov ecx, 10000
outer_loop:
mov eax, coeff
IACA_start ; outside the loop
ALIGN 32 ; this matters on Core2, .78 insn per cycle vs. 0.63 without
L1:
movsd xmm3, [eax]
mulsd xmm3, xmm1
mulsd xmm1, xmm2
addsd xmm1, xmm3
add eax, 8
cmp eax, coeff_end
jb L1
IACA_end
dec ecx
jnz outer_loop
;mov eax, 1
;int 0x80 ; exit() for 32bit code
xor edi, edi
mov eax, 231 ; exit_group(0). __NR_exit = 60.
syscall
section .data
x:
one: dq 1.0
section .bss
coeff: resq 24*1024 ; 6 * L1 size. Doesn't run any faster when it fits in L1 (resb)
coeff_end:
Experimental test
$ asm-link interiteration-test.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 interiteration-test.asm
+ ld -o interiteration-test interiteration-test.o
$ perf stat ./interiteration-test
Performance counter stats for './interiteration-test':
928.543744 task-clock (msec) # 0.995 CPUs utilized
152 context-switches # 0.164 K/sec
1 cpu-migrations # 0.001 K/sec
52 page-faults # 0.056 K/sec
2,222,536,634 cycles # 2.394 GHz (50.14%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1,723,575,954 instructions # 0.78 insns per cycle (75.06%)
246,414,304 branches # 265.377 M/sec (75.16%)
51,483 branch-misses # 0.02% of all branches (74.74%)
0.933372495 seconds time elapsed
Each branch / every 7 instructions is one iteration of the inner loop.
$ bc -l
bc 1.06.95
1723575954 / 7
246225136.28571428571428571428
# ~= number of branches: good
2222536634 / .
9.026
# cycles per iteration
IACA agrees: 9c per iteration on IvB
(not counting the nops from ALIGN):
$ iaca.sh -arch IVB interiteration-test
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - interiteration-test
Binary Format - 64Bit
Architecture - IVB
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 9.00 Cycles Throughput Bottleneck: InterIteration
Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------------------------
| Cycles | 2.0 0.0 | 1.0 | 0.5 0.5 | 0.5 0.5 | 0.0 | 2.0 |
-------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | movsd xmm3, qword ptr [eax]
| 1 | 1.0 | | | | | | | mulsd xmm3, xmm1
| 1 | 1.0 | | | | | | CP | mulsd xmm1, xmm2
| 1 | | 1.0 | | | | | CP | addsd xmm1, xmm3
| 1 | | | | | | 1.0 | | add eax, 0x8
| 1 | | | | | | 1.0 | | cmp eax, 0x63011c
| 0F | | | | | | | | jb 0xffffffffffffffe7
Total Num Of Uops: 6

With the addsd change suggested in my comments above, --> addsd xmm0,xmm3, this can be coded to use the full width registers and the performance is twice as fast.
Loosely:
For the initial value of ones, it needs to be:
double ones[2] = { 1.0, x }
And we need to replace x with x2:
double x2[2] = { x * x, x * x }
If there is an odd number of coefficients, pad it with a zero to produce an even number of them.
And, changing the pointer increment to 16.
Here are the test results I got. I did a number of trials and took the ones that had the best time and elongated the time by doing 100 iterations. std is the C version, dbl is your version, and qed is the "wide" version:
R=1463870188
C=100
T=100
I=100
x: 3.467957099973322e+00 3.467957099973322e+00
one: 1.000000000000000e+00 3.467957099973322e+00
x2: 1.202672644725538e+01 1.202672644725538e+01
std: 2.803772098439484e+56 (ELAP: 0.000019312)
dbl: 2.803772098439484e+56 (ELAP: 0.000019312)
qed: 2.803772098439492e+56 (ELAP: 0.000009060)
rtn_loop: 2.179378907910304e+55 2.585834207648461e+56
rtn_shuf: 2.585834207648461e+56 2.179378907910304e+55
rtn_add: 2.803772098439492e+56 2.585834207648461e+56
This was done on an i7 920 # 2.67 GHz.
I think if you take the elapsed numbers and convert them, you'll see that your version is faster than you think.
I apologize, in advance, for switching to AT&T syntax as I had difficulty getting the assembler to work the other way. Again, sorry. Also, I'm using linux, so I used the rdi rsi registers to pass the coefficient pointers. If you're on windows, the ABI is different and you'll have to adjust for that.
I did a C version and diassembled it. It was virtually identical to your code except that it rearranged the non-xmm instructions a bit, which I've added below.
I believe I posted all the files, so you could conceivably run this on your system if you wished.
Here's the original code:
# xmmloop/dbl.s -- implement using single double
.globl dbl
# dbl -- compute result using single double
#
# arguments:
# rdi -- pointer to coeff vector
# rsi -- pointer to coeff vector end
dbl:
movsd x(%rip),%xmm2 # get x value
movsd one(%rip),%xmm1 # get ones
xorps %xmm0,%xmm0 # sum = 0
dbl_loop:
movsd (%rdi),%xmm3 # c[i]
add $8,%rdi # increment to next vector element
cmp %rsi,%rdi # done yet?
mulsd %xmm1,%xmm3 # c[i]*x^i
mulsd %xmm2,%xmm1 # x^(i+1)
addsd %xmm3,%xmm0 # sum += c[i]*x^i
jb dbl_loop # no, loop
retq
Here's the code changed to use the movapd et. al:
# xmmloop/qed.s -- implement using single double
.globl qed
# qed -- compute result using single double
#
# arguments:
# rdi -- pointer to coeff vector
# rsi -- pointer to coeff vector end
qed:
movapd x2(%rip),%xmm2 # get x^2 value
movapd one(%rip),%xmm1 # get [1,x]
xorpd %xmm4,%xmm4 # sum = 0
qed_loop:
movapd (%rdi),%xmm3 # c[i]
add $16,%rdi # increment to next coefficient
cmp %rsi,%rdi # done yet?
mulpd %xmm1,%xmm3 # c[i]*x^i
mulpd %xmm2,%xmm1 # x^(i+2)
addpd %xmm3,%xmm4 # sum += c[i]*x^i
jb qed_loop # no, loop
movapd %xmm4,rtn_loop(%rip) # save intermediate DEBUG
movapd %xmm4,%xmm0 # get lower sum
shufpd $1,%xmm4,%xmm4 # get upper value into lower half
movapd %xmm4,rtn_shuf(%rip) # save intermediate DEBUG
addsd %xmm4,%xmm0 # add upper sum to lower
movapd %xmm0,rtn_add(%rip) # save intermediate DEBUG
retq
Here's a C version of the code:
// xmmloop/std -- compute result using C code
#include <xmmloop.h>
// std -- compute result using C
double
std(const double *cur,const double *ep)
{
double xt;
double xn;
double ci;
double sum;
xt = x[0];
xn = one[0];
sum = 0;
for (; cur < ep; ++cur) {
ci = *cur; // get c[i]
ci *= xn; // c[i]*x^i
xn *= xt; // x^(i+1)
sum += ci; // sum += c[i]*x^i
}
return sum;
}
Here's the test program I used:
// xmmloop/xmmloop -- test program
#define _XMMLOOP_GLO_
#include <xmmloop.h>
// tvget -- get high precision time
double
tvget(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= 1e9;
sec += ts.tv_sec;
return sec;
}
// timeit -- get best time
void
timeit(fnc_p proc,double *cofptr,double *cofend,const char *tag)
{
double tvbest;
double tvbeg;
double tvdif;
double sum;
sum = 0;
tvbest = 1e9;
for (int trycnt = 1; trycnt <= opt_T; ++trycnt) {
tvbeg = tvget();
for (int iter = 1; iter <= opt_I; ++iter)
sum = proc(cofptr,cofend);
tvdif = tvget();
tvdif -= tvbeg;
if (tvdif < tvbest)
tvbest = tvdif;
}
printf("%s: %.15e (ELAP: %.9f)\n",tag,sum,tvbest);
}
// main -- main program
int
main(int argc,char **argv)
{
char *cp;
double *cofptr;
double *cofend;
double *cur;
double val;
long rseed;
int cnt;
--argc;
++argv;
rseed = 0;
cnt = 0;
for (; argc > 0; --argc, ++argv) {
cp = *argv;
if (*cp != '-')
break;
switch (cp[1]) {
case 'C':
cp += 2;
cnt = strtol(cp,&cp,10);
break;
case 'R':
cp += 2;
rseed = strtol(cp,&cp,10);
break;
case 'T':
cp += 2;
opt_T = (*cp != 0) ? strtol(cp,&cp,10) : 1;
break;
case 'I':
cp += 2;
opt_I = (*cp != 0) ? strtol(cp,&cp,10) : 1;
break;
}
}
if (rseed == 0)
rseed = time(NULL);
srand48(rseed);
printf("R=%ld\n",rseed);
if (cnt == 0)
cnt = 100;
if (cnt & 1)
++cnt;
printf("C=%d\n",cnt);
if (opt_T == 0)
opt_T = 100;
printf("T=%d\n",opt_T);
if (opt_I == 0)
opt_I = 100;
printf("I=%d\n",opt_I);
cofptr = malloc(sizeof(double) * cnt);
cofend = &cofptr[cnt];
val = drand48();
for (; val < 3; val += 1.0);
x[0] = val;
x[1] = val;
DMP(x);
one[0] = 1.0;
one[1] = val;
DMP(one);
val *= val;
x2[0] = val;
x2[1] = val;
DMP(x2);
for (cur = cofptr; cur < cofend; ++cur) {
val = drand48();
val *= 1e3;
*cur = val;
}
timeit(std,cofptr,cofend,"std");
timeit(dbl,cofptr,cofend,"dbl");
timeit(qed,cofptr,cofend,"qed");
DMP(rtn_loop);
DMP(rtn_shuf);
DMP(rtn_add);
return 0;
}
And the header file:
// xmmloop/xmmloop.h -- common control
#ifndef _xmmloop_xmmloop_h_
#define _xmmloop_xmmloop_h_
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#ifdef _XMMLOOP_GLO_
#define EXTRN_XMMLOOP /**/
#else
#define EXTRN_XMMLOOP extern
#endif
#define XMMALIGN __attribute__((aligned(16)))
EXTRN_XMMLOOP int opt_T;
EXTRN_XMMLOOP int opt_I;
EXTRN_XMMLOOP double x[2] XMMALIGN;
EXTRN_XMMLOOP double x2[2] XMMALIGN;
EXTRN_XMMLOOP double one[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_loop[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_shuf[2] XMMALIGN;
EXTRN_XMMLOOP double rtn_add[2] XMMALIGN;
#define DMP(_sym) \
printf(#_sym ": %.15e %.15e\n",_sym[0],_sym[1]);
typedef double (*fnc_p)(const double *cofptr,const double *cofend);
double std(const double *cofptr,const double *cofend);
double dbl(const double *cofptr,const double *cofend);
double qed(const double *cofptr,const double *cofend);
#endif

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Julia: Struct with constant field to optimize performance - performance

Related

Faster bitwise AND operation on byte slices

NASM x64 on macOS using mach_absolute_time - Trouble getting nanoseconds (working code review)

Performance of <: Any in Julia

What is the reason for different performance of the same implementation using icc, gcc and clang?

x86: Long loop-carried dependency chain. Why 13 cycles?

Categories

Resources