What does "a GP/function address pair" mean in IA-64? - abi

What does "a GP/function address pair" mean in Itanium C++ ABI? What does GP stand for?

Short explanation: gp is, for all practical means, a hidden parameter to all functions that comply with the Itanium ABI. It's a kind of this pointer to the global variables the function uses. As far as I know, no mainstream OS does it anymore.
GP stands for "globals pointer". It's a base address for data statically allocated by executables, and the Itanium architecture has a register just for it.
For instance, if you had these global variables and this function in your program:
int foo;
int bar;
int baz;
int func()
{
foo++;
bar += foo;
baz *= bar / foo;
return foo + bar + baz;
}
The gp/function pair would conceptually be &foo, &func. The code generated for func would refer to gp to find where the globals are located. The compiler knows foo can be found at gp, bar can be found at gp + 4 and baz can be found at gp + 8.
Assuming func is defined in an external library, if you call it from your program, the compiler will use a sequence of instructions like this one:
save current gp value to the stack;
load code address from the pair for func into some register;
load gp value from same pair into GP;
perform indirect call to the register where we stored the code address;
restore old gp value that we saved on the stack before, resume calling function.
This makes executables fully position-independent since they don't ever store absolute addresses to data symbols, and therefore makes it possible to maintain only one instance of any executable file in memory, no matter how many processes use it (you could even load the same executable multiple times within a single process and still only have one copy of the executable code systemwide), at the cost of making function pointers a little weird. With the Itanium ABI, a function pointer is not a code address (like it is with "regular" x86 ABIs): it's an address to a gp value and a code address, since that code address might not be worth much if it can't access its global variables, just like a method might not be able to do much if it doesn't have a this pointer.
The only other ABI I know that uses this concept was the Mac OS Classic PowerPC ABI. They called those pairs "transition vectors".
Since x86_64 supports RIP-relative addressing (x86 did not have an equivalent EIP-relative addressing), it's now pretty easy to create position-independent code without having to use an additional register or having to use "enhanced" function pointers. Code and data just have to be kept at constant offsets. Therefore, this part of the Itanium ABI is probably gone for good on Intel platforms.
From the Itanium Register Conventions:
8.2 The gp Register
Every procedure that references statically-allocated data or calls another procedure requires a pointer to its data segment in the gp register, so that it can access its static data and its linkage tables. Each load module has its own data segment, and the gp register must be set correctly prior to calling any entry point within that load module.
The linkage conventions require that each load module define exactly one gp value to refer to a location within its short data segment. It is expected that this location will be chosen to maximize the usefulness of short-displacement immediate instructions for addressing scalars and linkage table entries. The DLL loader will determine the absolute value of the gp register for each load module after loading its data segment into memory.
For calls within a load module, the gp register will remain unchanged, so calls known to be local can be optimized accordingly.
For calls between load modules, the gp register must be initialized with the correct gp value for the new load module, and the calling function must ensure that its own gp value is saved and restored.

Just a comment about this quote from the other answer:
It is expected that this location will be chosen to maximize the usefulness of short-displacement immediate instructions for addressing scalars and linkage table entries.
What this is talking about: Itanium has three different ways to put a value into a register (where 'immediate' here means 'offset from the base'). You can support a full 64 bit offset from anywhere, but it takes two instructions:
// r34 has base address
movl r33 = <my immediate>
;;
add r35 = r34, r35
;;
Not only does that take 2 separate clocks, it takes 3 instruction slots across 2 bundles to make that happen.
There are two shorter versions: add14 (also adds) and add22 (also addl). The difference was in the immediate size each could handle. Each took a single 'A' slot iirc, and completed in a single clock.
add14 could use any register as the source & target, but could only handle up to 14 bit immediates.
add22 could use any register as the target, but for source, only two bits were allocated. So you could only use r0, r1, r2, r3 as the source regs. r0 is not a real register - it's hardwired to 0. But using one of the other 3 as a local stack registers, means you can address 256 times the memory using simple offsets, compared to using the local stack registers. Therefore, if you put your global base address into r1 (the convention), you could access that much more local offsets before having to do a separate movl and/or modifying gp for the next section of code.

Related

Documentation for MIPS predefined macros

When I compile a C code using GCC to MIPS, it contains code like:
daddiu $28,$28,%lo(%neg(%gp_rel(f)))
And I have trouble understanding instructions starting with %.
I found that they are called macros and predefined macros are dependent on the assembler but I couldn't find description of the macros (as %lo, %neg etc.) in the documentation of gas.
So does there exist any official documentation that explains macros used by GCC when generating MIPS code?
EDIT: The snippet of the code comes from this code.
This is a very odd instruction to find in compiled C code, since this instruction is not just using $28/$gp as a source but also updating that register, which the compiler shouldn't be doing, I would think.  That register is the global data pointer, which is setup on program start, and used by all code accessing near global variables, so it shouldn't ever change once established.  (Share a godbolt.org example, if you would.)
The functions you're referring to are for composing the address of labels that are located in global data.  Unlike x86, MIPS cannot load (or otherwise have) a 32-bit immediate in one instruction, and so it uses multiple instructions to do work with 32-bit immediates including address immediates.  A 32-bit immediate is subdivided into 2 parts — the top 16-bits are loaded using an LUI and the bottom 16-bits using an ADDI (or LW/SW instruction), forming a 2 instruction sequence.
MARS does not support these built-in functions.  Instead, it uses the pseudo instruction, la $reg, label, which is expanded by the assembler into such a sequence.  MARS also allows lw $reg, label to directly access the value of a global variable, however, that also expands to multiple instruction sequence (sometimes 3 instructions of which only 2 are really necessary..).
%lo computes the low 16-bits of a 32-bit address for the label of the argument to the "function".  %hi computes the upper 16-bits of same, and would be used with LUI.  Fundamentally, I would look at these "functions" as being a syntax for the assembly author to communicate to the assembler to share certain relocation information/requirements to the linker.  (In reverse, a disassembler may read relocation information and determine usage of %lo or %hi, and reflect that in the disassembly.)
I don't know %neg() or %gp_rel(), though could guess that %neg negates and %gp_rel produces the $28/$gp relative value of the label.
%lo and %hi are a bit odd in that the value of the high immediate sometimes is offset by +1 — this is done when the low 16-bits will appear negative.  ADDI and LW/SW will sign extend, which will add -1 to the upper 16-bits loaded via LUI, so %hi offsets its value by +1 to compensate when that happens.  This is part of the linker's operation since it knows the full 32-bit address of the label.
That generated code is super weird, and completely different from that generated by the same compiler, but 32-bit version.  I added the option -msym32 and then the generated code looks like I would expect.
So, this has something to do with the large(?) memory model on MIPS 64, using a multiple instruction sequence to locate and invoke g, and swapping the $28/$gp register as part of the call.  Register $25/$t9 is somehow also involved as the generated code sources it without defining it; later, prior to where we would expect the call it sets $25.
One thing I particularly don't understand, though, is where is the actual function invocation in that sequence!  I would have expected a jalr instruction, if it's using an indirect branch because it doesn't know where g is (except as data), but there's virtually nothing but loads and stores.
There are two additional oddities in the output: one is the blank line near where the actual invocation should be (maybe those are normal, but usually don't see those inside a function) and the other is a nop that is unnecessary but might have been intended for use in the delay slot following an invocation instruction.

Registers usage during compilation

I found information that general purpose registers r1-r23 and r26-r28 are used by the compiler to store local variables, but do they have any other purpose? Also which memory are this registers part of(cache/RAM)?
Finally what does global pointer gp in register r26 points to?
Also which memory are this registers part of(cache/RAM)?
Register are on-processors storage allowing a fast data transfer (2 reads/1 write per cycle). They store variables that can represent memory addresses, but, besides that, are completely unrelated to memory or cache.
I found information that general purpose registers r1-r23 and r26-r28 are used by the compiler to store local variables, but do they have any other purpose?
Registers are use with respect to hardware or software conventions. Hardware conventions are related to the instruction set architecture. For instance, the call instruction transfers control to a subroutine and stores return address in register r31 (ra). Very nasty things are likely to happen if you overwrite r31 register by any mean without precautions. Software conventions are supposed to insure a proper behavior if used consistently within software. They indicate which register have special use, which need to be saved when context switching, etc. These conventions can be changed without hardware modifications, but doing so will probably require changes in several software tools (compiler, linker, loader, OS, ...).
general purpose registers r1-r23 and r26-r28 are used by the compiler to store local variables
Actually, some registers are reserved.
r1 is used by asm for macro expansion. (sw)
r2-r7 are used by the compiler to pass arguments to functions or get return values. (sw)
r24-r25 can only be used by exception handlers. (sw)
r26-r28 hold different pointers (global, stack, frame) that are set either by the runtime or the compiler and cannot be modified by the programmer.(sw)
r29-r31 are hw coded returns addresses for subprograms or interrupts/exceptions. (hw)
So only r8-r23 can used by the compiler.
but do they have any other purpose?
No, and that's why they can be freely used by the compiler or programmer.
Finally what does global pointer in register r26 points to?
Accessing memory with load or stores have a based memory addressing. Effective address for ldx or stx (where 'x' is is b, bu, h, etc depending on data characteristics) is computed by adding a register and a 16 bits immediate. This only allows to go an an address within +/-32k of the content of register.
If the processor has the address of a var in a register (for instance the value returned by a malloc) the immediate allows to do a displacement to access fields in a struct, next array value, etc.
If the address is local or global, it must be computed by the program. Pointers registers are used to that purpose. Local vars addresses are computed by adding an immediate to the stack pointer (r27or sp).
Addresses of global or static vars are computed by adding an integer to the global pointer (r26 or gp). Content of gp corresponds to the start of the memory data segment and is initialized by the loader just before program execution and must not be modified. The immediate displacement with respect to the start of data segment is computed by the linker when it defines memory layout.
Note that this only allows to access 64k memory because of the 16 bits immediate width. If the size of global/static variables exceeds this value and a var is not within this range, a couple of instructions are required to enter the 32 bits of the address of the var before the data transfer. With gp this is not required and it is a way to provide a faster access to global variables.

How do I retrieve high and low-order parts of a value from two registers in inline assembly?

I'm currently working on a little game that can run from the boot sector of a hard drive, just for something fun to do. This means my program runs in 16-bit real mode, and I have my compiler flags set up to emit pure i386 code. I'm writing the game in C++, but I do need a lot of inline assembly to talk to the BIOS via interrupt calls. Some of these calls return a 32-bit integer, but stored in two 16-bit registers. Currently I'm doing the following to get my number out of the assembly:
auto getTicks = [](){
uint16_t ticksL{ 0 }, ticksH{ 0 };
asm volatile("int $0x1a" : "=c"(ticksH), "=d"(ticksL) : "a"(0x0));
return static_cast<uint32_t>( (ticksH << 16) | ticksL );
};
This is a lambda function I use to call this interrupt function which returns a tick count. I'm aware that there are better methods to get time data, and that I haven't implemented a check for AL to see if midnight has passed, but that's another topic.
As you can see, I have to use two 16-bit values, get the register values separately, then combine them into a 32-bit number the way you see at the return statement.
Is there any way I could retrieve that data into a single 32-bit number in my code right away avoid the shift and bitwise-or? I know that those 16-bit registers I'm accessing are really just the higher and lower 16-bits of a 32-bit register in reality, but I have no idea how to access the original 32-bit register as a whole.
I know that those 16-bit registers I'm accessing are really just the higher and lower 16-bits of a 32-bit register in reality, but I have no idea how to access the original 32-bit register as a whole.
As Jester has already pointed out, these are in fact 2 separate registers, so there is no way to retrieve "the original 32-bit register."
One other point: That interrupt modifies the ax register (returning the 'past midnight' flag), however your asm doesn't inform gcc that you are changing ax. Might I suggest something like this:
asm volatile("int $0x1a" : "=c"(ticksH), "=d"(ticksL), "=a"(midnight) : "a"(0x0));
Note that midnight is also a uint16_t.
As other answers suggest you can't load DX and CX directly into a 32-bit register. You'd have to combine them as you suggest.
In this case there is an alternative. Rather than using INT 1Ah/AH=0h you can read the BIOS Data Area (BDA) in low memory for the 32-bit DWORD value and load it into a 32-bit register. This is allowed in real mode on i386 processors. Two memory addresses of interest:
40:6C dword Daily timer counter, equal to zero at midnight;
incremented by INT 8; read/set by INT 1A
40:70 byte Clock rollover flag, set when 40:6C exceeds 24hrs
These two memory addresses are in segment:offset format, but would be equivalent to physical address 0x0046C and 0x00470.
All you'd have to do is temporarily set the DS register to 0 (saving the previous value), turn off interrupts with CLI retrieve the values from lower memory using C/C++ pointers, re-enable interrupts with STI and restore DS to the previously saved value. This of course is added overhead in the boot sector compared to using INT 1Ah/AH=0h but would allow you direct access to the memory addresses the BIOS is reading/writing on your behalf.
Note: If DS is set to zero already no need to save/set/restore it. Since we don't see the code that sets up the environment before calling into the C++ code I don't know what your default segment values are. If you don't need to retrieve both the roll over and timer values and only wish to get them individually you can eliminate the CLI/STI.
You're looking for the 'A' constraint, which refers to the dx:ax register pair as a double-wide value. You can see the full set of defined constraints for x86 in the gcc documentation. Unfortunately there are no constraints for any other register pairs, so you have to get them as two values and reassemble them with shift and or, like you describe.

STM32F4 running FreeRTOS in external RAM

We have a thesis project at work were the guys are trying to get external RAM to work for the STM32F417 MCU.
The project is trying out some stuff that is really resource hungry and the internal RAM just isn't enough.
The question is how to best do this.
The current approach has been to just replace the RAM address in the link script (gnu ld) with the address for external RAM.
The problem with that approach is that during initialisation, the chip has to run on internal RAM since the FSMC has not been initialized.
It seems to work but as soon as pvPortMalloc is run we get a hard fault and it is probably due to dereferencing bogus addresses, we can see that variables are not initialized correctly at system init (which makes sense I guess since the internal RAM is not used at all, when it probably should be).
I realize that this is a vague question, but what is the general approach when running code in external RAM on a Cortex M4 MCU, more specifically the STM32F4?
Thanks
FreeRTOS defines and uses a single big memory area for stack and heap management; this is simply an array of bytes, the size of which is specified by the configTOTAL_HEAP_SIZE symbol in FreeRTOSConfig.h. FreeRTOS allocates tasks stack in this memory area using its pvPortMalloc function, therefore the main goal here is to place the FreeRTOS heap area into external SRAM.
The FreeRTOS heap memory area is defined in heap_*.c (with the exception of heap_3.c that uses the standard library malloc and it doesn't define any custom heap area), the variable is called ucHeap. You can use your compiler extensions to set its section. For GCC, that would be something like:
static uint8_t ucHeap[ configTOTAL_HEAP_SIZE ] __attribute__ ((section (".sram_data")));
Now we need to configure the linker script to place this custom section into external SRAM. There are several ways to do this and it depends again on the toolchain you're using. With GCC one way to do this would be to define a memory region for the SRAM and a section for ".sram_data" to append to the SRAM region, something like:
MEMORY
{
...
/* Define SRAM region */
sram : ORIGIN = <SRAM_START_ADDR>, LENGTH = <SRAM_SIZE>
}
SECTIONS
{
...
/* Define .sram_data section and place it in sram region */
.sram_data :
{
*(.sram_data)
} >sram
...
}
This will place the ucHeap area in external SRAM, while all the other text and data sections will be placed in the default memory regions (internal flash and ram).
A few notes:
make sure you initialize the SRAM controller/FSMC prior to calling any FreeRTOS function (like xTaskCreate)
once you start the tasks, all stack allocated variables will be placed in ucHeap (i.e. ext RAM), but global variables are still allocated in internal RAM. If you still have internal RAM size issues, you can configure other global variables to be placed in the ".sram_section" using compiler extensions (as shown for ucHeap)
if your code uses dynamic memory allocation, make sure you use pvPortMalloc/vPortFree, instead of the stdlib malloc/free. This is because only pvPortMalloc/vPortFree will use the ucHeap area in ext RAM (and they are thread-safe, which is a plus)
if you're doing a lot of dynamic task creation/deletion and memory allocation with pvPortMalloc/vPortFree with different memory block sizes, consider using heap_4.c instead of heap_2.c. heap_2.c has memory fragmentation problems when using several different block sizes, whereas heap_4.c is able to combine adjacent free memory blocks into a single large block
Another (and possibly simpler) solution would be to define the ucHeap variable as a pointer instead of an array, like this:
static uint8_t * const ucHeap = <SRAM_START_ADDR>;
This wouldn't require any special linker script editing, everything can be placed in the default sections. Note that with this solution the linker won't explicitly reserve any memory for the heap and you will loose some potentially useful information/errors (like heap area not fitting in ext RAM). But as long as you only have ucHeap in external RAM and you have configTOTAL_HEAP_SIZE smaller than external RAM size, that might work just fine.
When the application starts up it will try to initialise data by either clearing it to zero, or initialising it to a non-zero value, depending on the section the variable is placed in. Using a normal run time model, that will happen before main() is called. So you have something like:
1) Reset vector calls init code
2) C run time init code initialises variables
3) C run time init code calls main()
If you use the linker to place variables in external RAM then you need to ensure the RAM is accessible before that initialisation takes place, otherwise you will get a hard fault. Therefore you need to either have a boot loader that sets up the system for you, then starts your application....or more simply just edit the start up code to do the following:
1) Reset vector calls init code
2) >>>C run time init code configures external RAM<<<
3) C run time init code initialised variables
4) C run time init code calls main().
That way the RAM is available before you try to access it.
However, if all you want to do is have the FreeRTOS heap in external RAM, then you can leave the init code untouched, and just use an appropriate heap implementation - basically one that does not just declare a large static array. For example, if you use heap_5 then all you need to do is ensure the heap init function is called before any allocation is performed, because the heap init just describes which RAM to use as the heap, rather than statically declaring the heap.

Why doesn't gcc handle volatile register?

I'm working on a timing loop for the AVR platform where I'm counting down a single byte inside an ISR. Since this task is a primary function of my program, I'd like to permanently reserve a processor register so that the ISR doesn't have to hit a memory barrier when its usual code path is decrement, compare to zero, and reti.
The avr-libc docs show how to bind a variable to a register, and I got that working without a problem. However, since this variable is shared between the main program (for starting the timer countdown) and the ISR (for actually counting and signaling completion), it should also be volatile to ensure that the compiler doesn't do anything too clever in optimizing it.
In this context (reserving a register across an entire monolithic build), the combination volatile register makes sense to me semantically, as "permanently store this variable in register rX, but don't optimize away checks because the register might be modified externally". GCC doesn't like this, however, and emits a warning that it might go ahead and optimize away the variable access anyway.
The bug history of this combination in GCC suggests that the compiler team is simply unwilling to consider the type of scenario I'm describing and thinks it's pointless to provide for it. Am I missing some fundamental reason why the volatile register approach is in itself a Bad Idea, or is this a case that makes semantic sense but that the compiler team just isn't interested in handling?
The semantics of volatile are not exactly as you describe "don't optimize away checks because the register might be modified externally" but are actually more narrow: Try to think of it as "don't cache the variable's value from RAM in a register".
Seen this way, it does not make any sense to declare a register as volatile because the register itself cannot be 'cached' and therefore cannot possibly be inconsistent with the variable's 'actual' value.
The fact that read accesses to volatile variables are usually not optimzed away is merely a side effect of the above semantics, but it's not guaranteed.
I think GCC should assume by default that a value in a register is 'like volatile' but I have not verified that it actually does so.
Edit:
I just did a small test and found:
avr-gcc 4.6.2 does not treat global register variables like volatiles with respect to read accesses, and
the Naggy extension for Atmel Studio detects an error in my code: "global register variables are not supported".
Assuming that global register variables are actually considered "unsupported" I am not surprised that gcc treats them just like local variables, with the known implications.
My test code looks like this:
uint8_t var;
volatile uint8_t volVar;
register uint8_t regVar asm("r13");
#define NOP asm volatile ("nop\r\n":::)
int main(void)
{
var = 1; // <-- kept
if ( var == 0 ) {
NOP; // <-- optimized away, var is not volatile
}
volVar = 1; // <-- kept
if ( volVar == 0 ) {
NOP; // <-- kept, volVar *is* volatile
}
regVar = 1; // <-- optimized away, regVar is treated like a local variable
if ( regVar == 0 ) {
NOP; // <-- optimized away consequently
}
for(;;){}
}
The reason you would use the volatile keyword on AVR variables is to, as you said, avoid the compiler optimizing access to the variable. The question now is, how does this happen though?
A variable has two places it can reside. Either in the general purpose register file or in some location in RAM. Consider the case where the variable resides in RAM. To access the latest value of the variable, the compiler loads the variable from RAM, using some form of the ld instruction, say lds r16, 0x000f. In this case, the variable was stored in RAM location 0x000f and the program made a copy of this variable in r16. Now, here is where things get interesting if interrupts are enabled. Say that after loading the variable, the following occurs inc r16, then an interrupt triggers and its corresponding ISR is run. Within the ISR, the variable is also used. There is a problem, however. The variable exists in two different versions, one in RAM and one in r16. Ideally, the compiler should use the version in r16, but this one is not guaranteed to exist, so it loads it from RAM instead, and now, the code does not operate as needed. Enter then the volatile keyword. The variable is still stored in RAM, however, the compiler must ensure that the variable is updated in RAM before anything else happens, thus the following assembly may be generated:
cli
lds r16, 0x000f
inc r16
sei
sts 0x000f, r16
First, interrupts are disabled. Then, the the variable is loaded into r16. The variable is increased, interrupts are enabled and then the variable is stored. It may appear confusing for the global interrupt flag to be enabled before the variable is stored back in RAM, but from the instruction set manual:
The instruction following SEI will be executed before any pending interrupts.
This means that the sts instruction will be executed before any interrupts trigger again, and that the interrupts are disabled for the minimum amount of time possible.
Consider now the case where the variable is bound to a register. Any operations done on the variable are done directly on the register. These operations, unlike operations done to a variable in RAM, can be considered atomic, as there is no read -> modify -> write cycle to speak of. If an interrupt triggers after the variable is updated, it will get the new value of the variable, since it will read the variable from the register it was bound to.
Also, since the variable is bound to a register, any test instructions will utilize the register itself and will not be optimized away on the grounds the compiler may have a "hunch" it is a static value, given that registers by their very nature are volatile.
Now, from experience, when using interrupts in AVR, I have sometimes noticed that the global volatile variables never hit RAM. The compiler kept them on the registers all the time, bypassing the read -> modify -> write cycle alltogether. This was due, however, to compiler optimizations, and it should not be relied on. Different compilers are free to generate different assembly for the same piece of code. You can generate a disassembly of your final file or any particular object files using the avr-objdump utility.
Cheers.
Reserving a register for one variable for a complete compilation unit is probably too restrictive for a compiler's code generator. That is, every C routine would have to NOT use that register.
How do you guarantee that other called routines do NOT use that register once your code goes out of scope? Even stuff like serial i/o routines would have to NOT use that reserved register. Compilers do NOT recompile their run-time libraries based on a data definition in a user program.
Is your application really so time sensitive that the extra delay for bringing memory up from L2 or L3 can be detected? If so, then your ISR might be running so frequently that the required memory location is always available (i.e. it doesn't get paged back down thru the cache) and thus does NOT hit a memory barrier (I assume by memory barrier you are referring to how memory in a cpu really operates, through caching, etc.). But for this to really be true the up would have to have a fairly large L1 cache and the ISR would have to run at a very high frequency.
Finally, sometimes an application's requirements make it necessary to code it in ASM in which case you can do exactly what you are requesting!

Resources