Performance of <: Any in Julia - performance

I am new to Julia, and I am trying to understand performance implications of some of the constructs before getting used to bad habits. Currently, I am trying to understand the type system of Julia, especially the <: Any type annotation. As far as I understand, <: Any should stand for I don't care about the type.
Consider the following code
struct Container{T}
parametric::T
nonparametric::Int64
end
struct TypeAny
payload::Container{<: Any}
end
struct TypeKnown
payload::Container{Array{Int64,1}}
end
getparametric(x) = x.payload.parametric[1]
getnonparametric(x) = x.payload.nonparametric
xany = TypeAny(Container([1], 2))
xknown = TypeKnown(Container([1], 2))
#time for i in 1:10000000 getparametric(xany) end # 0.212002s
#time for i in 1:10000000 getparametric(xknown) end # 0.110531s
#time for i in 1:10000000 getnonparametric(xany) end # 0.173390s
#time for i in 1:10000000 getnonparametric(xknown) end # 0.086739s
First of all, I was surprised that getparametric(xany) works in the first place when it operates on a field Container{<: Any}.parametric of unknown type. How is that possible and what are the performance implications of such construct? Is Julia doing some kind of runtime reflection behind the scenes to make this possible, or something more sophisticated is going on?
Second, I was surprised by the difference in runtime between the calls getnonparametric(xany) and getnonparametric(xknown) which contradicts my intuition of using type annotation <: Any as an I don't care annotation. Why the call to getnonparametric(xany) is significantly slower, even though I use only a field of known type? And how to ignore type in case I do not want to use any variables of that type without taking a performance hit? (In my use case, it seems not to be possible to specify the concrete type as that would lead to infinitely recursive type definitions - but that could be caused by improper design of my code which is out of the scope of this question.)

<: Any should stand for I don't care about the type.
It is something like it can be any type (so compiler does not get any hint about the type). You could also write it as:
struct TypeAny
payload::Container
end
which is essentially the same as you can check using the following test:
julia> Container{<:Any} <: Container
true
julia> Container <: Container{<:Any}
true
How is that possible and what are the performance implications of such construct?
The performance implication is that the concrete type of the object you hold in your container is determined in run-time not in compile time (just as you have suspected).
Note however that if you pass such extracted object to a function then after a dynamic dispatch inside the called function the code will run fast (as it will be type stable). You can read more about it here.
or something more sophisticated is going on?
The more sophisticated thing happens for bits types. If a concrete bits type is a field in a container then it is stored as value. If its type is not known at compile time it will be stored as reference (which has yet additional memory and run time impact).
I was surprised by the difference in runtime between the calls
As commented above, the difference is due to the fact that at compile time the type of the field is not known. If you changed your definition to:
struct TypeAny{T}
payload::Container{T}
end
then you say I do not care about type, but store it in a parameter, so that compiler knows this type.
Then the type of payload would be known at compile time and all would be fast.
If something I have written above is not clear or you need some more explanations please comment and I will expand the answer.
As a side note - it is usually better to use BenchmarkTools.jl for performance analysis of your code (unless you want to measure compilation time also).
EDIT
Look at:
julia> loop(x) = for i in 1:10000000 getnonparametric(x) end
loop (generic function with 1 method)
julia> #code_native loop(xknown)
.text
; ┌ # REPL[14]:1 within `loop'
pushq %rbp
movq %rsp, %rbp
pushq %rax
movq %rdx, -8(%rbp)
movl $74776584, %eax # imm = 0x4750008
addq $8, %rsp
popq %rbp
retq
; └
julia> #code_native loop(xany)
.text
; ┌ # REPL[14]:1 within `loop'
pushq %rbp
movq %rsp, %rbp
pushq %rax
movq %rdx, -8(%rbp)
movl $74776584, %eax # imm = 0x4750008
addq $8, %rsp
popq %rbp
retq
; └
And you see that the compiler is smart enough to optimize-out the whole loop (as it is essentially a no-op). This is a power of Julia (but on the other hand - makes benchmarking hard sometimes).
Here is an example that shows you a more accurate view (note that I use a more complex expression, as even very simple expressions in loops can be optimized out by the compiler):
julia> xknowns = fill(xknown, 10^6);
julia> xanys = fill(xany, 10^6);
julia> #btime sum(getnonparametric, $xanys)
12.373 ms (0 allocations: 0 bytes)
2000000
julia> #btime sum(getnonparametric, $xknowns)
519.700 μs (0 allocations: 0 bytes)
2000000
Note that even in this case the compiler is "smart enough" to properly infer the return type of the expression in both cases as you access nonparametric field in both cases:
julia> #code_warntype sum(getnonparametric, xanys)
Variables
#self#::Core.Compiler.Const(sum, false)
f::Core.Compiler.Const(getnonparametric, false)
a::Array{TypeAny,1}
Body::Int64
1 ─ nothing
│ %2 = Base.:(#sum#559)(Base.:(:), #self#, f, a)::Int64
└── return %2
julia> #code_warntype sum(getnonparametric, xknowns)
Variables
#self#::Core.Compiler.Const(sum, false)
f::Core.Compiler.Const(getnonparametric, false)
a::Array{TypeKnown,1}
Body::Int64
1 ─ nothing
│ %2 = Base.:(#sum#559)(Base.:(:), #self#, f, a)::Int64
└── return %2
The core of the difference can be seen when you look at native code generated in both cases:
julia> #code_native getnonparametric(xany)
.text
; ┌ # REPL[6]:1 within `getnonparametric'
pushq %rbp
movq %rsp, %rbp
; │┌ # Base.jl:20 within `getproperty'
subq $48, %rsp
movq (%rcx), %rax
movq %rax, -16(%rbp)
movq $75966808, -8(%rbp) # imm = 0x4872958
movabsq $jl_f_getfield, %rax
leaq -16(%rbp), %rdx
xorl %ecx, %ecx
movl $2, %r8d
callq *%rax
; │└
movq (%rax), %rax
addq $48, %rsp
popq %rbp
retq
nopl (%rax,%rax)
; └
julia> #code_native getnonparametric(xknown)
.text
; ┌ # REPL[6]:1 within `getnonparametric'
pushq %rbp
movq %rsp, %rbp
; │┌ # Base.jl:20 within `getproperty'
movq (%rcx), %rax
; │└
movq 8(%rax), %rax
popq %rbp
retq
nopl (%rax)
; └
If you add parameter to the type all is working as expected:
julia> struct Container{T}
parametric::T
nonparametric::Int64
end
julia> struct TypeAny2{T}
payload::Container{T}
end
julia> xany2 = TypeAny2(Container([1], 2))
TypeAny2{Array{Int64,1}}(Container{Array{Int64,1}}([1], 2))
julia> #code_native getnonparametric(xany2)
.text
; ┌ # REPL[9]:1 within `getnonparametric'
pushq %rbp
movq %rsp, %rbp
; │┌ # Base.jl:20 within `getproperty'
movq (%rcx), %rax
; │└
movq 8(%rax), %rax
popq %rbp
retq
nopl (%rax)
; └
And you have:
julia> xany2s = fill(xany2, 10^6);
julia> #btime sum(getnonparametric, $xany2s)
528.699 μs (0 allocations: 0 bytes)
2000000
Summary
Always try to use containers that do not have fields of abstract type if you want performance.
Sometimes if condition in point 1. is not met the compiler can handle it efficiently and generate a fast machine code, but it is not guaranteed in general (so still the recommendation in point 1. applies).

Related

Y86 Architecture Immediate VS Register Arithmetic Efficiency Question

I am working with a team in a Computer Architecture class on a Y86 program to implement multiplication function imul. We have a block of code that works, but we are trying to make it as execution-time efficient as we can. Currently our block looks like this for imul:
imul:
# push all used registers to stack for preservation
pushq %rdi
pushq %rsi
pushq %r8
pushq %r9
pushq %r10
irmovq 0, %r9 # set 0 into r9
rrmovq %rdi, %r10 # preserve rdi in r10
subq %rsi, %rdi # compare rdi and rsi
rrmovq %r10, %rdi # restore rdi
jl continue # if rdi (looping value/count) less than rsi, don't swap
swap:
# swap rsi and rdi to make rdi smaller value of the two
rrmovq %rsi, %rdi
rrmovq %r10, %rsi
continue:
subq %r9, %rdi # check if rdi is zero
cmove %r9, %rax # if rdi = 0, rax = 0
je imulDone # if rdi = 0, jump to end
irmovq 1, %r8 # set 1 into r8
rrmovq %rsi, %rax # set rax equal to initial value from rsi
imulLoop:
subq %r8, %rdi # count - 1
je imulDone # if count = 0, jump to end
addq %rsi, %rax # add another instance of rsi into rax, looped adition
jmp imulLoop # restart loop
imulDone:
# pop all used registers from stack to original values and return
popq %r10
popq %r9
popq %r8
popq %rsi
popq %rdi
ret
Right now our best idea is using immediate arithmetic instructions (isubq, etc) instead of normal OPq instructions with settings constants into registers and using those registers. Would this method be meaningfully more efficient in this particular instance? Thanks so much!

Is movzbl followed by testl faster than testb?

Consider this C code:
int f(void) {
int ret;
char carry;
__asm__(
"nop # do something that sets eax and CF"
: "=a"(ret), "=#ccc"(carry)
);
return carry ? -ret : ret;
}
When I compile it with gcc -O3, I get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
negl %edx
testb %cl, %cl
cmovne %edx, %eax
ret
If I change char carry to int carry, I instead get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
movzbl %cl, %ecx
negl %edx
testl %ecx, %ecx
cmovne %edx, %eax
ret
That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:
f:
nop # do something that sets eax and CF
jnc .L1
negl %eax
.L1:
ret
It seems like one of two things must be true, but I'm not sure which:
A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.
My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.
By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.
There's no downside I'm aware of to reading a byte register with test vs. movzb.
If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.
Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.

How is it being specified which segment register should be used (x86)

Here's a function:
void func(char *ptr)
{
*ptr = 42;
}
Here's an output (cut) of gcc -S function.c:
func:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movb $42, (%rax)
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
I can use that function as:
func(malloc(1));
or as:
char local_var;
func(&local_var);
The question is how does processor determine which segment register should be used to transform the effective address to virtual one in this instruction (it may be DS as well as SS)
movb $42, (%rax)
I have a x86_64 proc.
The default segment is DS; that’s what the processor uses in your example.
In 64-bit mode, it doesn’t matter what segment is used, because the segment base is always 0 and permissions are ignored. (There are one or two minor differences, which I won’t go into here.)
In 32-bit mode, most OSes set the base of all segments to 0, and sets their permissions the same, so again it doesn’t matter.
In code where it does matter (especially 16 bit code that needs to use more than 64 KB of memory), the code must use far pointers, which include the segment selector as part of the pointer value. The software must load the selector into a segment register in order to perform the memory access.

Segmentation fault: 11 With Array Assignment in Loop Using x86 GNU GAS Assembly

This question is similar to another question I posted here. I am attempting to write the Assembly version of the following in c/c++:
int x[10];
for (int i = 0; i < 10; i++){
x[i] = i;
}
Essentially, creating an array storing the values 1 through 9.
My current logic is to create a label that loops up to 10 (calling itself until reaching the end value). In the label, I have placed the instructions to update the array at the current index of iteration. However, after compiling with gcc filename.s and running with ./a.out, the error Segmentation fault: 11 is printed to the console. My code is below:
.data
x:.fill 10, 4
index:.int 0
end:.int 10
.text
.globl _main
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
jmp outer_loop
leave
ret
outer_loop:
movl index(%rip), %eax;
cmpl end(%rip), %eax
jge end_loop
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
incl index(%rip)
jmp outer_loop
leave
ret
end_loop:
leave
ret
Oddly the code below
lea x(%rip), %rdi;
mov index(%rip), %rsi;
movl index(%rip), %eax;
movl %eax, (%rdi, %rsi, 4)
works only if it is not in a label that is called repetitively. Does anyone know how I can implement the code above in a loop, without Segmentation fault: 11 being raised? I am using x86 Assembly on MacOS with GNU GAS syntax compiled with gcc.
Please note that this question is not a duplicate of this question as different Assembly syntax is being used and the scope of the problem is different.
You're using a 64-bit instruction to access a 32-bit area of memory :
mov index(%rip), %rsi;
This results in %rsi being assigned the contents of memory starting from index and ending at end (I'm assuming no alignment, though I don't remember GAS's rules regarding it). Thus, %rsi effectively is assigned the value 0xa00000000 (assuming first iteration of the loop), and executing the following movl %eax, (%rdi, %rsi, 4) results in the CPU trying to access the address that's not mapped by your process.
The solution is to remove the assignment, and replace the line after it with movl index(%rip), %esi. 32-bit operations are guaranteed to always clear out the upper bits of 64-bit registers, so you can then safely use %rsi in the address calculation, as it's going to contain the current index and nothing more.
Your debugger would've told you this, so please do use it next time.

xorl %eax, %eax in x86_64 assembly code produced by gcc

I'm a total noob at assembly, just poking around a bit to see what's going on. Anyway, I wrote a very simple function:
void multA(double *x,long size)
{
long i;
for(i=0; i<size; ++i){
x[i] = 2.4*x[i];
}
}
I compiled it with:
gcc -S -m64 -O2 fun.c
And I get this:
.file "fun.c"
.text
.p2align 4,,15
.globl multA
.type multA, #function
multA:
.LFB34:
.cfi_startproc
testq %rsi, %rsi
jle .L1
movsd .LC0(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movsd (%rdi,%rax,8), %xmm0
mulsd %xmm1, %xmm0
movsd %xmm0, (%rdi,%rax,8)
addq $1, %rax
cmpq %rsi, %rax
jne .L3
.L1:
rep
ret
.cfi_endproc
.LFE34:
.size multA, .-multA
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC0:
.long 858993459
.long 1073951539
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
The assembly output makes sense to me (mostly) except for the line xorl %eax, %eax. From googling, I gather that the purpose of this is simply to set %eax to zero, which in this case corresponds to my iterator long i;.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int. Moreover, further down in the code, it actually uses the 64-bit register %rax to do the iterating, which never gets initialized outside of xorl %eax %eax, which would seem to only zero out the lower 32 bits of the register.
Am I missing something?
Also, out of curiosity, why are there two .long constants there at the bottom? The first one, 858993459 is equal to the double floating-point representation of 2.4 but I can't figure out what the second number is or why it is there.
I gather that the purpose of this is simply to set %eax to zero
Yes.
which in this case corresponds to my iterator long i;.
No. Your i is uninitialized in the declaration. Strictly speaking, that operation corresponds to the i = 0 expression in the for loop.
However, unless I am mistaken, %eax is a 32-bit register. So it seems to me that this should actually be xorq %rax, %rax, particularly since this is holding a 64-bit long int.
But clearing the lower double word of the register clears the entire register. This is not intuitive, but it's implicit.
Just to answer the second part: .long means 32 bit, and the two integral constants side-by-side form the IEEE-754 representation of the double 2.4:
Dec: 1073951539 858993459
Hex: 0x40033333 0x33333333
400 3333333333333
S+E Mantissa
The exponent is offset by 1023, so the actual exponent is 0x400 − 1023 = 1. The leading "one" in the mantissa is implied, so it's 21 × 0b1.001100110011... (You recognize this periodic expansion as 3/15, i.e. 0.2. Sure enough, 2 × 1.2 = 2.4.)

Resources