Are tail-recursive functions ALWAYS to be avoided? - performance

If I recall correctly, tail recursive functions always have an easy non-recursive equivalent.
Since recursion involves unnecessary function call overhead, it's better to do it the non-recursive way.
Is this assumption always true? Are there any other arguments for/against tail-recursion?

If you are using a language with a good compiler then these types of recursion can be optimised away, so in those cases if it improves readability to use recursion, I'd say to stick with it.

No, it's not always true. Many languages and/or compilers can easily optimize a tail recursive call , and rewrite it to an iterative version, or in some way reuse the stack frame for subsequent calls.
The Scheme language mandates that implementation employ tail call optimization
gcc can optimize tail calls as well, consider a function for freeing all the nodes in a linked list:
void free_all(struct node *n)
{
if(n != NULL) {
struct node *next = n->next;
free(n);
free_all(next);
}
}
compiles to, with optimization:
free_all:
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $20, %esp
movl 8(%ebp), %eax
testl %eax, %eax
je .L4
.p2align 4,,7
.p2align 3
.L5:
movl 4(%eax), %ebx
movl %eax, (%esp)
call free
testl %ebx, %ebx
movl %ebx, %eax
jne .L5
.L4:
addl $20, %esp
popl %ebx
popl %ebp
ret
That is, a simple jump instead of recursivly calling free_all

No.
Go for readability. Many computations are better expressed as recursive (tail or otherwise) functions. The only other reason to avoid them would be if your compiler does not do tail call optimizations and you expect you might blow the call stack.

It depends on language, but often the overhead isn't that big. It may be subjective, but recursive functions tend be much easier to comprehend. Most of the time you wouldn't notice the performance difference.
I would go for tail recursion unless my platform was very bad at dealing with it (i.e. not doing it at all, but always pushing onto stack).

Related

Offset before square bracket in x86 intel asm on GCC

From all the docs I've found, there is no mention of syntax like offset[var+offset2] in Intel x86 syntax but GCC with the following flags
gcc -S hello.c -o - -masm=intel
for this program
#include<stdio.h>
int main(){
char c = 'h';
putchar(c);
return 0;
}
produces
.file "hello.c"
.intel_syntax noprefix
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
sub rsp, 16
mov BYTE PTR -1[rbp], 104
movsx eax, BYTE PTR -1[rbp]
mov edi, eax
call putchar#PLT
mov eax, 0
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Arch Linux 9.3.0-1) 9.3.0"
.section .note.GNU-stack,"",#progbits
I'd like to highlight the line mov BYTE PTR -1[rbp], 104 where offset -1 appears outside the square brackets. TBH, I'm just guessing that it is an offset, can anyone direct me to a proper documentation highlighting this ?
Here is a similar question: Squared Brackets in x86 asm from IDA where a comment does mention that it is an offset but I'd really like a proper documentation reference.
Yes, it's just another way of writing [rbp - 1], and the -1 is a displacement in technical x86 addressing mode terminology1.
The GAS manual's section on x86 addressing modes only mentions the [ebp - 4] possibility, not -4[ebp], but GAS does assemble it.
And disassembly in AT&T or Intel syntax confirms what it meant. x86 addressing modes are constrained by what the machine can encode (Referencing the contents of a memory location. (x86 addressing modes)), so there isn't a lot of wiggle room on what some syntax might mean. (This syntax was emitted by GCC so we can safely assume that it's valid. And that it means the same thing as the -1(%rbp) it emits in AT&T syntax mode.)
Footnote 1: The whole rbp-1 effective address is the offset part of a seg:off address. The segment base is fixed at 0 in 64-bit mode, except for FS and GS, and even in 32-bit mode mainstream OSes use a flat memory model, so you can ignore the segment base. I point this out only because "offset" in x86 terminology does have a specific technical meaning separate from "displacement", in case you care about using terminology that matches Intel's manuals.
For some reason GCC's choice of syntax depends on -fno-pie or not. https://godbolt.org/z/iK9jh6 (On modern GNU/Linux distros like your Arch system, -fpie is enabled by default. On Godbolt it isn't).
This choice continues with optimization enabled, if you use volatile to force the stack variable to be written, or do other stuff with pointers: e.g. https://godbolt.org/z/4P92Fk. It applies to arbitrary dereferences like ptr[1 + x] from function args.
GCC -fno-pie chooses [rbp - 1] and [rdi+4+rsi*4]
GCC -fpie chooses -1[rbp] and 4[rdi+rsi*4]
IDK why GCC's internals choose differently based on PIE mode. No obvious reason; perhaps for some reason they just use different code paths in GCC's internals, or different format strings and they just happen to make different choices.
Both with and without PIE, a global (static storage) is referenced as glob[rip], not [RIP + glob] which is also supported. In both cases that means glob with respect to RIP, not actually RIP + absolute address of the symbol. But that's an exception to the rule that applies for any other register, or for no register.
GAS .intel_syntax is MASM-like, and MASM certainly does support symbol[register] and I think even 1234[register]. It's more normal for the displacement.

Compilation of IORef and STRef

To measure the performance of those Refs, I dumped the assembly produced by GHC on the following code :
import Data.IORef
main = do
r <- newIORef 18
v <- readIORef r
print v
I expected the IORef to be completely optimized away, leaving only a syscall to write stdout with string "18". Instead I get 250 lines of assembly. Do you know how many will actually be executed ? Here is what I think is the heart of the program :
.globl Main.main1_info
Main.main1_info:
_c1Zi:
leaq -8(%rbp),%rax
cmpq %r15,%rax
jb _c1Zj
_c1Zk:
movq $block_c1Z9_info,-8(%rbp)
movl $Main.main2_closure+1,%ebx
addq $-8,%rbp
jmp stg_newMutVar#
_c1Zn:
movq $24,904(%r13)
jmp stg_gc_unpt_r1
.align 8
.long S1Zo_srt-(block_c1Z9_info)+0
.long 0
.quad 0
.quad 30064771104
block_c1Z9_info:
_c1Z9:
addq $24,%r12
cmpq 856(%r13),%r12
ja _c1Zn
_c1Zm:
movq 8(%rbx),%rax
movq $sat_s1Z2_info,-16(%r12)
movq %rax,(%r12)
movl $GHC.Types.True_closure+2,%edi
leaq -16(%r12),%rsi
movl $GHC.IO.Handle.FD.stdout_closure,%r14d
addq $8,%rbp
jmp GHC.IO.Handle.Text.hPutStr2_info
_c1Zj:
movl $Main.main1_closure,%ebx
jmp *-8(%r13)
I am concerned about this jmp stg_newMutVar#. It is nowhere else in the assembly, so maybe GHC resolves it at a later linking stage. But why is it even here and what does it do ? Can I dump the final assembly without any unresolved haskell symbols ?
Starting with a couple of links:
The MutVar object definition.
The cmm code for newMutVar.
A non-comprehensive but helpful summary of GHC object layout.
The cmm and C sources aren't particularly readable if you're not already familiar with the macros and primops. Unfortunately, I don't know of a good way to view the assembly generated for cmm primops, short of looking into an executable with objdump or some other disassembler.
Still, I can summarize the runtime semantics of IORef.
IORef is a wrapper around MutVar# from GHC.Prim. As the doc says, MutVar# is like a single-element mutable array. It takes up two machine words, the first is the header, the second is the stored value (which is a pointer to a GHC object). A value of MutVar# is itself a pointer to this two-word object.
MutVar-s differ from normal immutable objects most notably by participating in a write barrier mechanism. GHC has generational garbage collection, so any MutVar that lives in an older generation must be also a GC root when collecting the younger generations, since mutating a MutVar may cause younger objects to become reachable. Therefore, whenever a MutVar is promoted from generation 0 (the youngest), it is added to a so-called "mutable list" that contains references to all such mutable objects. The mutable list gets rebuilt during GC of old generations. In short, MutVar-s in old generations are always present on the mutable list.
This is a rather simplistic way of dealing with mutable variables, and if we have large numbers of them in old generations, minor garbage collection slows down because of the bloated mutable list, and as a result the entire program slows down.
Since mutable variables aren't used prominently in production code, there hasn't been much demand or pressure for optimizing the RTS for their heavy usage.
If you need a large number of mutable variables, you should instead use a single mutable boxed array, because that's only a single reference on the mutable list and also has a bitmap-based optimization for GC traversal of elements that might have been mutated.
Also, as you see newMutVar# is only statically linked but not inlined, although it's a rather small chunk of code. As a result, it's also not optimized away. This is again broadly because of the lack of effort and attention for optimizing mutating code. By contrast, allocating and copying small known-sized primitive arrays is currently inlined and greatly optimized, because Johan Tibell who did large amount of work implementing the unordered-containers library made it so (in order to make unordered-containers faster).

Assembly registers [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
My question WAS about getting as much info as I could about registers...No luck :/
Everyone got everything so wrong [Probably because English is not my native language].
So, the question will be more general... ;(
I need a tutorial with the BASICS!
Ah...Could I be more not-specific?
Also, thanks for the help in advance!
In general you can use any of eax, ebx, ecx, edx, esi and edi pretty much as you want. They can each hold any 32-bit value.
Keep in mind that if you call any Win32 API functions that they are free to modify eax, ecx and edx. So if you need to preserve the values of those registers across a function call you'll have to save them somewhere temporarily (e.g. on the stack).
Similarly, if you write a function that is to be called by another function (e.g. a Windows callback) you should preserve ebx, esi,edi and ebp within that function.
Some instructions are hardcoded to use certain registers. For example, the loop instruction uses (e)cx, the string instructions use esi/edi, the div instruction uses eax/edx, etc. You can find all such cases by going through the descriptions for all the instructions in Intel's manual.
The "fixed uses" of the registers derive from the ancient roots back in the 8086 days (and in some ways, even from before that).
The 8086 was an accumulator machine, you were supposed to do math mostly with ax (there was no eax yet), and a bit with dx. You can see this back in many instructions, for example most ALU ops have a smaller form for op ax, imm (also op al, imm) than for op other, imm, and the ancient decimal math instructions (daa and friends) operate only on al. There are instructions that always reference (e)ax and maybe (e)dx as "high half", see the "old multiplication" (with the single explicit operand), imul with an immediate was added in the 80186, imul reg, r/m was added in the 80386 which added a whole lot of stuff including 32bit mode. With 32bit mode also came the modern ModRM/SIB structure, here are the old 16bit version and the modern 32/64bit version. In the old version, there are only 4 registers that could ever be used in a memory operand, so there's a bit of the "fixed roles for registers" again. 32bit mode mostly removed that, except that esp can never be the index register (that wouldn't normally make sense anyway).
More recently, Haswell introduced shlx which removes the restriction that shifting by a variable amount could only be done using cl as the count, and mulx partially removed the fixed roles of registers for "wide multiplication" (80186 and 80386 only added the "general" forms for multiplication without the high half), mulx still gives edx a fixed role though.
More strangely, the relatively recently added pblendvb assigned a fixed role to xmm0, previous to that the vector registers weren't encumbered by such old-fashioned restrictions. That fixed role disappeared with AVX though, which allowed the extra operand to be encoded. pcmpistri and friends still assign a fixed role to ecx though.
With x64 came a change to 8 bit register operands, if a REX prefix is present it is now possible to use spl, bpl, sil and dil, previously unencodable, but at the cost of being able to address ah, ch, dh or bh. That's probably a symptom of moving away from special roles too, since previously it wouldn't have made much sense to be able to use bpl, but now that it's "more general purpose" it might have some uses (it's still often used as a base pointer though).
The general pattern is towards fewer restrictions/fixed roles. But much of the history of x86 is still visible today.
As a general comment, before you go much further, I recommend adopting a programming style, or you'll find it very hard to follow your own code. Below is a formatted example of your code, maybe not everything is correctly formatted but it gives you an idea. Once in the habit, it's easier than making higgledy-piggledy code. One of its main advantages, is with practice you can cast your eye down the code and follow it far quicker than if you have to read every line.
.386
.model flat, stdcall
option casemap :none
include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\masm32.inc
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\masm32.lib
.data
ProgramText db "Hello World!", 0
BadText db "Error: Sum is incorrect value", 0
GoodText db "Excellent! Sum is 6", 0
Sum sdword 0
.code
start:
; eax
mov ecx, 6 ; set the counter to 6 ?
xor eax, eax ; set eax to 0
_label:
add eax, ecx ; add the numbers ?
dec ecx ; from 0 to 6 ?
jnz _label ; 21
mov edx, 7 ; 21
mul edx ; multiply by 7 147
push eax ; pushes eax into the stack
pop Sum ; pops eax and places it in Sum
cmp Sum, 147 ; compares Sum to 147
jz _good ; if they are equal, go to _good
_bad:
invoke StdOut, addr BadText
jmp _quit
_good:
invoke StdOut, addr GoodText
_quit:
invoke ExitProcess, 0
end start
I'll single out one line:
push eax ; pushes eax into the stack
Don't use comments to explain what an instruction does: use them to say what you are trying to acheive, or what the register represents, to give added value to the code.
Good luck to you: plenty of practice and midnight oil!

Why does GCC emit "lea" instead of "sub" for subtraction?

I am looking at some assembly that was generated by disassembling some C programs and I am confused by a single optimization that I see repeated frequently.
When I have no optimizations on the GCC compiler uses the subl instruction for subtraction, but when I do have optimizations turned on (-O3 to be precise) the compiler uses a leal instruction instead of subtraction, example below:
without optimizations:
83 e8 01 subl $0x1, %eax
with optimizations
8d 6f ff leal -0x1(%edi), %ebp
Both of these instructions are 3 bytes long, so I am not seeing an optimization here. Could someone help me out and try to explain the compiler's choice ?
Any help would be appreciated.
It's hard to tell without seeing the original C code that produces this.
But if I had to guess, it's because the leal allows the subtraction to be done out-of-place without destroying the source register.
This can save an extra register move.
The first example:
83 e8 01 subl $0x1, %eax
overwrites %eax thereby destroying the original value.
The second example :
8d 6f ff leal -0x1(%edi), %ebp
stores %edi - 1 into %ebp. %edi is preserved for future use.
Keep in mind also that lea does not affect the flags whereas sub does. So if the ensuing instructions do not depend on the flags being updated by the subtraction then not updating the flags will be more efficient as well.

Integer Overflow IA 32

How does overflow work in ia-32?
For instance, what would happen to the following code? What flags would it throw?
movl $0x1, %eax
addl $7fffffff, %eax
Thanks!
If memory serves, addition sets the overflow flag is set when the sign bit changes without the carry bit being set. 1 + 0x7FFFFFFF would set overflow, clear carry, and clear zero.

Resources