Related
I encountered a performance issue, where the shared memory's atomicAdd on float is much more expensive than it on int after profiling with nv-nsight-cu-cli.
After checking the generated SASS, I found the generated SASS of the shared memory's atomicAdd on float and int are not similar at all.
Here I show a example in minimal cuda code:
$ cat test.cu
__global__ void testAtomicInt() {
__shared__ int SM_INT;
SM_INT = 0;
__syncthreads();
atomicAdd(&(SM_INT), ((int)1));
}
__global__ void testAtomicFloat() {
__shared__ float SM_FLOAT;
SM_FLOAT = 0.0;
__syncthreads();
atomicAdd(&(SM_FLOAT), ((float)1.1));
}
$ nvcc -arch=sm_86 -c test.cu
$ cuobjdump -sass test.o
Fatbin elf code:
================
arch = sm_86
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_86
Function : _Z15testAtomicFloatv
.headerflags #"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fc40000000f00 */
/*0010*/ STS [RZ], RZ ; /* 0x000000ffff007388 */
/* 0x000fe80000000800 */
/*0020*/ BAR.SYNC 0x0 ; /* 0x0000000000007b1d */
/* 0x000fec0000000000 */
/*0030*/ LDS R2, [RZ] ; /* 0x00000000ff027984 */
/* 0x000e240000000800 */
/*0040*/ FADD R3, R2, 1.1000000238418579102 ; /* 0x3f8ccccd02037421 */
/* 0x001fcc0000000000 */
/*0050*/ ATOMS.CAST.SPIN R3, [RZ], R2, R3 ; /* 0x00000002ff03738d */
/* 0x000e240001800003 */
/*0060*/ ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ; /* 0x000000010300780c */
/* 0x001fda0003f02070 */
/*0070*/ #!P0 BRA 0x30 ; /* 0xffffffb000008947 */
/* 0x000fea000383ffff */
/*0080*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0090*/ BRA 0x90; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*00a0*/ NOP; /* 0x0000000000007918 */
..........
Function : _Z13testAtomicIntv
.headerflags #"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fc40000000f00 */
/*0010*/ STS [RZ], RZ ; /* 0x000000ffff007388 */
/* 0x000fe80000000800 */
/*0020*/ BAR.SYNC 0x0 ; /* 0x0000000000007b1d */
/* 0x000fec0000000000 */
/*0030*/ ATOMS.POPC.INC.32 RZ, [URZ] ; /* 0x00000000ffff7f8c */
/* 0x000fe2000d00003f */
/*0040*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0050*/ BRA 0x50; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0060*/ NOP; /* 0x0000000000007918 */
..........
Fatbin ptx code:
================
arch = sm_86
code version = [7,5]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
From the generated SASS code above, we could clearly obtain, that the shared memory's atomicAdd on int generates single lightweight ATOMS.POPC.INC.32 RZ, [URZ], while it on float generating a bunch of SASS with a heavyweight ATOMS.CAST.SPIN R3, [RZ], R2, R3 .
The CUDA Binary Utilities doesn't show me the meaning of CAST or SPIN. However, I could guess it means an exclusive spin lock on a shared memory address. (Correct me, if my guess goes wrong.)
In my real code, none of the SASS of atomicAdd of int has a hotspot. However, this ATOMS.CAST.SPIN is significantly hotter than other SASS code generated by of the atomicAdd of float.
In addition, I tested with compiler flag -arch=sm_86, -arch=sm_80 and -arch=sm_75. Under those CCs, the generated SPSS code of atomicAdd of float is very similar. Another fact is, with no surprise, the atomicAdd of double generates SPSS alike it of float.
This observation caused me more confusion than questions. I would go with some simple questions from my profiling experience and hope we could have a nice discussion.
What does exactly ATOMS.CAST.SPIN do? The only SASS document I am aware of is the CUDA Binary Utilities.
Why should the atomicAdd of float generates more SASS code and does more work than it on int? I know it is a general question and hard to be answered. Maybe the ATOMS.POPC.INC simply doesn't apply to data type float or double?
If it is more vulnerable to have more shared memory load and store conflict and thus more stall time for the atomicAdd of float than the atomicAdd of int? The former has clearly more instruction to be executed and divergent branches. I have the following code snippet in my project where the number of function calls on two functions is the same. However, the atomicAdd of float builds a runtime bottleneck while it is on int doesn't.
atomicAdd(&(SM_INT), ((int)1)); // no hotspot
atomicAdd(&(SM_FLOAT), ((float)1.1)); // a hotspot
I probably won't be able to provide an answer addressing every possible question. CUDA SASS is really not documented to the level to explain these things.
What does exactly ATOMS.CAST.SPIN do? The only SASS document I am aware of is the CUDA Binary Utilities.
ATOMS.CAST.SPIN
^^^^^ ^^^
|| |
|| compare and swap
|shared
atomic
The programming guide gives an indication of how one can implement an "arbitrary" atomic operation, using atomic CAS (Compare And Swap). You should first familiarize yourself with how atomic CAS works.
Regarding the "arbitrary atomic" example, the thing to note is that it can evidently be used to provide atomic operations for e.g. datatypes that are not supported by a "native" atomic instruction, such as atomic add. Another thing to note is that it is essentially a loop around an atomic CAS instruction, with the loop checking to see if the operation was "successful" or not. If it was "unsuccessful", the loop continues. If it was "successful", the loop exits.
This is effectively what we see depicted in SASS code in your float example:
/*0030*/ LDS R2, [RZ] ; // get the current value in the location
FADD R3, R2, 1.1000000238418579102 ; // perform ordinary floating-point add
ATOMS.CAST.SPIN R3, [RZ], R2, R3 ; // attempt to atomically replace the result in the location
ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ; // check if replacement was successful
#!P0 BRA 0x30 // if not, loop and try again
These are essentially the steps that are outlined in the "arbitrary atomic" example in the programming guide. Based on this I would conclude the following:
the architecture you compiled for does not actually have a "native" atomic operation of the type you are requesting
the atomic operation you are requesting can be done using the looping method
the compiler tool chain (typically ptxas, but could also be the JIT system), as a convenience feature, is automatically implementing this looping method for you, rather than throwing a compile error
Why should the atomicAdd of float generates more SASS code and does more work than it on int?
Evidently, the architecture you are compiling for does not have a "native" implementation of atomic add for float, and so the compiler tool chain has chosen to implement this looping method for you. Since the loop effectively involves the possibility of success/failure which will determine whether this loop continues, and success/failure depends on other threads behavior (contention to perform the atomic), the looping method may do considerably more "work" than a native single instruction will.
If it is more vulnerable to have more shared memory load and store conflict and thus more stall time for the atomicAdd of float than the atomicAdd of int?
Yes, I personally would conclude that the native atomic method is more efficient, and the looping method may be less efficient, which could be expressed in a variety of ways in the profiler, such as warp stalls.
EDIT:
It's possible for things to be implemented/available in one GPU architecture but not another. This is certainly applicable to atomics, and you can see examples of this if you read the previously linked section on atomics in the programming guide. I don't know of any architectures today that are "newer" than cc8.0 or cc8.6 (Ampere) but it is certainly possible that the behavior of a future (or any other) GPU could be different here.
This loop-around-atomicCAS method is distinct from a previous methodology (lock/update/unlock, which also involves a loop for lock negotiation) the compiler toolchain used on Kepler and prior architectures to provide atomics on shared memory when no formal SASS instructions existed to do so.
I'm building a tiny microcontroller with only the bare essentials for self-educational purposes. This way, I can refresh my knowledge about topics like the linkerscript, the startup code, ...
EDIT:
I got quite a lot of comments pointing out that the "absolute minimal STM32-application" shown below is no good. You are absolutely right when noticing that the vector table is not complete, the .bss-section is not taken care of, the peripheral addresses are not complete, ... Please allow me to explain why.
It has never been the purpose of the author to write a complete and useful application in this particular chapter. His purpose was to explain step-by-step how a linkerscript works, how startup code works, what the boot procedure of an STM32 looks like, ... purely for educational purposes. I can appreciate this approach, and learned a lot.
The example I have put below is taken from the middle of the chapter in question. The chapter keeps adding more parts to the linkerscript and startup code (for example initialization of .bss-section) as it goes forward.
The reason I put files here from the middle of his chapter, is because I got stuck at a particular error message. I want to get that fixed before continuing.
The chapter in question is somewhere at the end of his book. It is intended for the more experienced or curious reader who wants to gain deeper knowledge about topics most people don't even consider (most people use the standard linkerscript and startup code given by the manufacturer without ever reading it).
Keeping this in mind, please let us focus on the technical issue at hand (as described below in the error messages). Please also accept my sincere apologies that I didn't clarify the intentions of the writer earlier. But I've done it now, so we can move on ;-)
1. Absolute minimal STM32-application
The tutorial I'm following is chapter 20 from this book: "Mastering STM32" (https://leanpub.com/mastering-stm32). The book explains how to make a tiny microcontroller application with two files: main.c and linkerscript.ld. As I'm not using an IDE (like Eclipse), I also added build.bat and clean.bat to generate the compilation commands. So my project folder looks like this:
Before I continue, I should perhaps give some more details about my system:
OS: Windows 10, 64-bit
Microcontroller: NUCLEO-F401RE board with STM32F401RE microcontroller.
Compiler: arm-none-eabi-gcc version 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437].
The main file looks like this:
/* ------------------------------------------------------------ */
/* Minimal application */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
typedef unsigned long uint32_t;
/* Memory and peripheral start addresses (common to all STM32 MCUs) */
#define FLASH_BASE 0x08000000
#define SRAM_BASE 0x20000000
#define PERIPH_BASE 0x40000000
/* Work out end of RAM address as initial stack pointer
* (specific of a given STM32 MCU) */
#define SRAM_SIZE 96*1024 //STM32F401RE has 96 KB of RAM
#define SRAM_END (SRAM_BASE + SRAM_SIZE)
/* RCC peripheral addresses applicable to GPIOA
* (specific of a given STM32 MCU) */
#define RCC_BASE (PERIPH_BASE + 0x23800)
#define RCC_APB1ENR ((uint32_t*)(RCC_BASE + 0x30))
/* GPIOA peripheral addresses
* (specific of a given STM32 MCU) */
#define GPIOA_BASE (PERIPH_BASE + 0x20000)
#define GPIOA_MODER ((uint32_t*)(GPIOA_BASE + 0x00))
#define GPIOA_ODR ((uint32_t*)(GPIOA_BASE + 0x14))
/* Function headers */
int main(void);
void delay(uint32_t count);
/* Minimal vector table */
uint32_t *vector_table[] __attribute__((section(".isr_vector"))) = {
(uint32_t*)SRAM_END, // initial stack pointer (MSP)
(uint32_t*)main // main as Reset_Handler
};
/* Main function */
int main() {
/* Enable clock on GPIOA peripheral */
*RCC_APB1ENR = 0x1;
/* Configure the PA5 as output pull-up */
*GPIOA_MODER |= 0x400; // Sets MODER[11:10] = 0x1
while(1) { // Always true
*GPIOA_ODR = 0x20;
delay(200000);
*GPIOA_ODR = 0x0;
delay(200000);
}
}
void delay(uint32_t count) {
while(count--);
}
The linkerscript looks like this:
/* ------------------------------------------------------------ */
/* Linkerscript */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
/* Memory layout for STM32F401RE */
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
SRAM (xrw) : ORIGIN = 0x20000000, LENGTH = 96K
}
/* The ENTRY(..) directive overrides the default entry point symbol _start.
* Here we define the main-routine as the entry point.
* In fact, the ENTRY(..) directive is meaningless for embedded chips,
* but it is informative for debuggers. */
ENTRY(main)
SECTIONS
{
/* Program code into FLASH */
.text : ALIGN(4)
{
*(.isr_vector) /* Vector table */
*(.text) /* Program code */
*(.text*) /* Merge all .text.* sections inside the .text section */
KEEP(*(.isr_vector)) /* Don't allow other tools to strip this off */
} >FLASH
_sidata = LOADADDR(.data); /* Used by startup code to initialize data */
.data : ALIGN(4)
{
. = ALIGN(4);
_sdata = .; /* Create a global symbol at data start */
*(.data)
*(.data*)
. = ALIGN(4);
_edata = .; /* Define a global symbol at data end */
} >SRAM AT >FLASH
}
The build.bat file calls the compiler on main.c, and next the linker:
#echo off
setlocal EnableDelayedExpansion
echo.
echo ----------------------------------------------------------------
echo. )\ ***************************
echo. ( =_=_=_=^< ^| * build NUCLEO-F401RE *
echo. )( ***************************
echo. ""
echo.
echo.
echo. Call the compiler on main.c
echo.
#arm-none-eabi-gcc main.c -o main.o -c -MMD -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O0 -g3 -Wall -fmessage-length=0 -Werror-implicit-function-declaration -Wno-comment -Wno-unused-function -ffunction-sections -fdata-sections
echo.
echo. Call the linker
echo.
#arm-none-eabi-gcc main.o -o myApp.elf -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -specs=nosys.specs -specs=nano.specs -T linkerscript.ld -Wl,-Map=output.map -Wl,--gc-sections
echo.
echo. Post build
echo.
#arm-none-eabi-objcopy -O binary myApp.elf myApp.bin
arm-none-eabi-size myApp.elf
echo.
echo ----------------------------------------------------------------
The clean.bat file removes all the compiler output:
#echo off
setlocal EnableDelayedExpansion
echo ----------------------------------------------------------------
echo. __ **************
echo. __\ \___ * clean *
echo. \ _ _ _ \ **************
echo. \_`_`_`_\
echo.
del /f /q main.o
del /f /q main.d
del /f /q myApp.bin
del /f /q myApp.elf
del /f /q output.map
echo ----------------------------------------------------------------
Building this works. I get the following output:
C:\Users\Kristof\myProject>build
----------------------------------------------------------------
)\ ***************************
( =_=_=_=< | * build NUCLEO-F401RE *
)( ***************************
""
Call the compiler on main.c
Call the linker
Post build
text data bss dec hex filename
112 0 0 112 70 myApp.elf
----------------------------------------------------------------
2. Proper startup code
Maybe you have noticed that the minimal application didn't have proper startup code to initialize the global variables in the .data-section. Chapter 20.2.2 .data and .bss Sections initialization from the "Mastering STM32" book explains how to do this.
As I follow along, my main.c file now looks like this:
/* ------------------------------------------------------------ */
/* Minimal application */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
typedef unsigned long uint32_t;
/* Memory and peripheral start addresses (common to all STM32 MCUs) */
#define FLASH_BASE 0x08000000
#define SRAM_BASE 0x20000000
#define PERIPH_BASE 0x40000000
/* Work out end of RAM address as initial stack pointer
* (specific of a given STM32 MCU) */
#define SRAM_SIZE 96*1024 //STM32F401RE has 96 KB of RAM
#define SRAM_END (SRAM_BASE + SRAM_SIZE)
/* RCC peripheral addresses applicable to GPIOA
* (specific of a given STM32 MCU) */
#define RCC_BASE (PERIPH_BASE + 0x23800)
#define RCC_APB1ENR ((uint32_t*)(RCC_BASE + 0x30))
/* GPIOA peripheral addresses
* (specific of a given STM32 MCU) */
#define GPIOA_BASE (PERIPH_BASE + 0x20000)
#define GPIOA_MODER ((uint32_t*)(GPIOA_BASE + 0x00))
#define GPIOA_ODR ((uint32_t*)(GPIOA_BASE + 0x14))
/* Function headers */
void __initialize_data(uint32_t*, uint32_t*, uint32_t*);
void _start (void);
int main(void);
void delay(uint32_t count);
/* Minimal vector table */
uint32_t *vector_table[] __attribute__((section(".isr_vector"))) = {
(uint32_t*)SRAM_END, // initial stack pointer (MSP)
(uint32_t*)_start // _start as Reset_Handler
};
/* Variables defined in linkerscript */
extern uint32_t _sidata;
extern uint32_t _sdata;
extern uint32_t _edata;
volatile uint32_t dataVar = 0x3f;
/* Data initialization */
inline void __initialize_data(uint32_t* flash_begin, uint32_t* data_begin, uint32_t* data_end) {
uint32_t *p = data_begin;
while(p < data_end)
*p++ = *flash_begin++;
}
/* Entry point */
void __attribute__((noreturn,weak)) _start (void) {
__initialize_data(&_sidata, &_sdata, &_edata);
main();
for(;;);
}
/* Main function */
int main() {
/* Enable clock on GPIOA peripheral */
*RCC_APB1ENR = 0x1;
/* Configure the PA5 as output pull-up */
*GPIOA_MODER |= 0x400; // Sets MODER[11:10] = 0x1
while(dataVar == 0x3f) { // Always true
*GPIOA_ODR = 0x20;
delay(200000);
*GPIOA_ODR = 0x0;
delay(200000);
}
}
void delay(uint32_t count) {
while(count--);
}
I've added the initialization code just above the main(..) function. The linkerscript has also some modification:
/* ------------------------------------------------------------ */
/* Linkerscript */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
/* Memory layout for STM32F401RE */
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
SRAM (xrw) : ORIGIN = 0x20000000, LENGTH = 96K
}
/* The ENTRY(..) directive overrides the default entry point symbol _start.
* In fact, the ENTRY(..) directive is meaningless for embedded chips,
* but it is informative for debuggers. */
ENTRY(_start)
SECTIONS
{
/* Program code into FLASH */
.text : ALIGN(4)
{
*(.isr_vector) /* Vector table */
*(.text) /* Program code */
*(.text*) /* Merge all .text.* sections inside the .text section */
KEEP(*(.isr_vector)) /* Don't allow other tools to strip this off */
} >FLASH
_sidata = LOADADDR(.data); /* Used by startup code to initialize data */
.data : ALIGN(4)
{
. = ALIGN(4);
_sdata = .; /* Create a global symbol at data start */
*(.data)
*(.data*)
. = ALIGN(4);
_edata = .; /* Define a global symbol at data end */
} >SRAM AT >FLASH
}
The little application doesn't compile anymore. Actually, the compilation from main.c to main.o is still okay. But the linking process gets stuck:
C:\Users\Kristof\myProject>build
----------------------------------------------------------------
)\ ***************************
( =_=_=_=< | * build NUCLEO-F401RE *
)( ***************************
""
Call the compiler on main.c
Call the linker
c:/gnu_arm_embedded_toolchain/bin/../lib/gcc/arm-none-eabi/6.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m/fpv4-sp/hard/crt0.o: In function `_start':
(.text+0x64): undefined reference to `__bss_start__'
c:/gnu_arm_embedded_toolchain/bin/../lib/gcc/arm-none-eabi/6.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m/fpv4-sp/hard/crt0.o: In function `_start':
(.text+0x68): undefined reference to `__bss_end__'
collect2.exe: error: ld returned 1 exit status
Post build
arm-none-eabi-objcopy: 'myApp.elf': No such file
arm-none-eabi-size: 'myApp.elf': No such file
----------------------------------------------------------------
3. What I've tried
I've omitted this part, otherwise this question gets too long ;-)
4. Solution
#berendi provided the solution. Thank you #berendi! Apparently I need to add the flags -nostdlib and -ffreestanding to gcc and the linker. The build.bat file now looks like this:
#echo off
setlocal EnableDelayedExpansion
echo.
echo ----------------------------------------------------------------
echo. )\ ***************************
echo. ( =_=_=_=^< ^| * build NUCLEO-F401RE *
echo. )( ***************************
echo. ""
echo.
echo.
echo. Call the compiler on main.c
echo.
#arm-none-eabi-gcc main.c -o main.o -c -MMD -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O0 -g3 -Wall -fmessage-length=0 -Werror-implicit-function-declaration -Wno-comment -Wno-unused-function -ffunction-sections -fdata-sections -ffreestanding -nostdlib
echo.
echo. Call the linker
echo.
#arm-none-eabi-gcc main.o -o myApp.elf -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -specs=nosys.specs -specs=nano.specs -T linkerscript.ld -Wl,-Map=output.map -Wl,--gc-sections -ffreestanding -nostdlib
echo.
echo. Post build
echo.
#arm-none-eabi-objcopy -O binary myApp.elf myApp.bin
arm-none-eabi-size myApp.elf
echo.
echo ----------------------------------------------------------------
Now it works!
In his answer, #berendi also gives a few interesting remarks about the main.c file. I've applied most of them:
Missing volatile keyword
Empty loop
Missing Memory Barrier (did I put the memory barrier in the correct place?)
Missing delay after RCC enable
Misleading symbolic name (apparently it should be RCC_AHB1ENR instead of RCC_APB1ENR).
The vector table: this part I've skipped. Right now I don't really need a HardFault_Handler, MemManage_Handler, ... as this is just a tiny test for educational purposes.
Nevertheless, I did notice that #berendi put a few interesting modifications in the way he declares the vector table. But I'm not entirely grasping what he's doing exactly.
The main.c file now looks like this:
/* ------------------------------------------------------------ */
/* Minimal application */
/* for NUCLEO-F401RE */
/* ------------------------------------------------------------ */
typedef unsigned long uint32_t;
/**
\brief Data Synchronization Barrier
\details Acts as a special kind of Data Memory Barrier.
It completes when all explicit memory accesses before this instruction complete.
*/
__attribute__((always_inline)) static inline void __DSB(void)
{
__asm volatile ("dsb 0xF":::"memory");
}
/* Memory and peripheral start addresses (common to all STM32 MCUs) */
#define FLASH_BASE 0x08000000
#define SRAM_BASE 0x20000000
#define PERIPH_BASE 0x40000000
/* Work out end of RAM address as initial stack pointer
* (specific of a given STM32 MCU) */
#define SRAM_SIZE 96*1024 //STM32F401RE has 96 KB of RAM
#define SRAM_END (SRAM_BASE + SRAM_SIZE)
/* RCC peripheral addresses applicable to GPIOA
* (specific of a given STM32 MCU) */
#define RCC_BASE (PERIPH_BASE + 0x23800)
#define RCC_AHB1ENR ((volatile uint32_t*)(RCC_BASE + 0x30))
/* GPIOA peripheral addresses
* (specific of a given STM32 MCU) */
#define GPIOA_BASE (PERIPH_BASE + 0x20000)
#define GPIOA_MODER ((volatile uint32_t*)(GPIOA_BASE + 0x00))
#define GPIOA_ODR ((volatile uint32_t*)(GPIOA_BASE + 0x14))
/* Function headers */
void __initialize_data(uint32_t*, uint32_t*, uint32_t*);
void _start (void);
int main(void);
void delay(uint32_t count);
/* Minimal vector table */
uint32_t *vector_table[] __attribute__((section(".isr_vector"))) = {
(uint32_t*)SRAM_END, // initial stack pointer (MSP)
(uint32_t*)_start // _start as Reset_Handler
};
/* Variables defined in linkerscript */
extern uint32_t _sidata;
extern uint32_t _sdata;
extern uint32_t _edata;
volatile uint32_t dataVar = 0x3f;
/* Data initialization */
inline void __initialize_data(uint32_t* flash_begin, uint32_t* data_begin, uint32_t* data_end) {
uint32_t *p = data_begin;
while(p < data_end)
*p++ = *flash_begin++;
}
/* Entry point */
void __attribute__((noreturn,weak)) _start (void) {
__initialize_data(&_sidata, &_sdata, &_edata);
asm volatile("":::"memory"); // <- Did I put this instruction at the right spot?
main();
for(;;);
}
/* Main function */
int main() {
/* Enable clock on GPIOA peripheral */
*RCC_AHB1ENR = 0x1;
__DSB();
/* Configure the PA5 as output pull-up */
*GPIOA_MODER |= 0x400; // Sets MODER[11:10] = 0x1
while(dataVar == 0x3f) { // Always true
*GPIOA_ODR = 0x20;
delay(200000);
*GPIOA_ODR = 0x0;
delay(200000);
}
}
void delay(uint32_t count) {
while(count--){
asm volatile("");
}
}
PS: The book "Mastering STM32" from Carmine Noviello is an absolute masterpiece. You should read it! => https://leanpub.com/mastering-stm32
You can tell gcc not to use the library.
The Compiler
By default, gcc assumes that you are using a standard C library, and can emit code that calls some functions. For example, when optimizations are enabled, it detects loops that copy a piece of memory, and may substitute them with a call to memcpy(). Disable it with -ffreestanding.
The Linker
The linker assumes as well that you want to link your program with the C library and startup code. The library startup code is responsible for initializing the library and the program execution environment. It has a function named _start() which has to be called after reset. One of its functions is to fill the .bss segment (see below) with zero. If the symbols that delimit .bss are not defined, then _startup() cannot be linked. Had you named your startup function anything else but _startup(), then the library startup would have been siletly dropped by the linker as an unused function, and the code could have been linked.
You can tell the linker not to link any standard library or startup code with -nostdlib, then the library supplied startup function name would not conflict with yours, and you would get a linker error every time you accidentally invoked a library function.
Missing volatile
Your register definitions are missing the volatile qualifier. Without it, subsequent writes to *GPIOA_ODR will be optimized out. The compiler will move this "invariant code" out of the loop. Changing the type in the register definitions to (volatile uint32_t*) would fix that.
Empty loop
The optimizer can recognize that the delay loop does nothing, and eliminate it completely to speed up execution. Add an empty but non-removable asm volatile(""); instruction to the delay loop.
Missing Memory Barrier
You are initializing the .data section that holds dataVar in a C function. The *p in __initialize_data() is effectively an alias for dataVar, and the compiler has no way to know it. The optimizer could theoretically rearrange the test of dataVar before __initialize_data(). Even if dataVar is volatile, *p is not, therefore ordering is not guaranteed.
After the data initialization loop, you should tell the compiler that program variables are changed by a mechanism unknown to the compiler:
asm volatile("":::"memory");
It's an old-fashioned gcc extension, the latest C standards might have defined a portable way to do this (which is not recognized by older gcc versions).
Missing delay after RCC enable
The Errata is saying,
A delay between an RCC peripheral clock enable and the effective peripheral enabling should be taken into account in order to manage the peripheral read/write to registers.
This delay depends on the peripheral mapping:
• If the peripheral is mapped on AHB: the delay should be equal to 2 AHB cycles.
• If the peripheral is mapped on APB: the delay should be equal to 1 + (AHB/APB prescaler) cycles.
Workarounds
Use the DSB instruction to stall the Cortex®-M4 CPU pipeline until the instruction is completed.
Therefore, insert a
__DSB();
after *RCC_APB1ENR = 0x1; (which should be called something else)
Misleading symbolic name
Although the address for enabling GPIOA in RCC seems to be correct, the register is called RCC_AHB1ENR in the documentation. It will confuse people trying to understand your code.
The Vector Table
Although technically you can get away with having only a stack pinter and a reset handler in it, I'd too recommend having a few more entries, at least the fault handlers for simple troubleshooting.
__attribute__ ((section(".isr_vector"),used))
void (* const _vectors[]) (void) = {
(void (*const)(void))(&__stack),
Reset_Handler,
NMI_Handler,
HardFault_Handler,
MemManage_Handler,
BusFault_Handler,
UsageFault_Handler
}
The Linker Script
At the bare minimum, it must define a section for your vector table, and the code. A program must have a start address and some code, static data is optional. The rest depends on what kind of data your program is using. You could technically omit them from the linker script if there are no data of a particular type.
.rodata: read-only data, const arrays and structs go here. They remain in flash. (simple const variables are usually put in the code)
.data: initialized variables, everything you declare with an = sign, and without const.
.bss: variables that should be zero-initialized in C, i.e. global and static ones.
As you don't need .rodata or .bss now, it's fine.
Linker scripts in general are an artform, they are their own programming language and gnu's are certainly a bit of a nightmare. Divide the task into figuring out the linker script from making a working binary, once you can see the linker script is doing what you want then make the bootstrap code to use it. Take advantage of the toolchain.
The example the author used was derived from code written specifically to be used as baremetal examples that maximize success. Avoided common language and toolchain issues, yet be portable across many versions of the toolchain and to be easily ported to other toolchains (minimal reliance on the toolchain, in particular the linker script which leads to the bootstrap). The author of the book used that code but added risk to it to not be as reliable of an example.
Avoiding .data specifically and not relying on .bss to be zeroed when you write baremetal code goes a very long way toward long term success.
It was also modified such that optimization would prevent that code from working (well blinking at a rate you can see).
An example somewhat minimal linker script for binutils that you can modify to work toward .data and .bss initialization looks generically like this
test.ld
MEMORY
{
bob : ORIGIN = 0x8000, LENGTH = 0x1000
ted : ORIGIN = 0xA000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > bob
__data_rom_start__ = .;
.data : {
__data_start__ = .;
*(.data*)
} > ted AT > bob
__data_end__ = .;
__data_size__ = __data_end__ - __data_start__;
.bss : {
__bss_start__ = .;
*(.bss*)
} > ted
__bss_end__ = .;
__bss_size__ = __bss_end__ - __bss_start__;
}
(note memory names dont have to be rom or ram or flash or data or whatever bob is program space and ted is memory btw, change the addresses as desired)
How you see what is going on is you can link with a simple example or with your code, you need some .data and some .bss (and some .text).
vectors.s
.thumb
.globl _start
_start:
.word 0x20001000
.word reset
.thumb_func
reset:
bl notmain
b .
.globl bss_start
bss_start: .word __bss_start__
.globl bss_end
bss_end: .word __bss_end__
.word __bss_size__
.globl data_rom_start
data_rom_start:
.word __data_rom_start__
.globl data_start
data_start:
.word __data_start__
.globl data_end
data_end:
.word __data_end__
.word __data_size__
so.c
unsigned int a=1;
unsigned int b=2;
unsigned int c;
unsigned int d;
unsigned int e;
unsigned int notmain ( void )
{
return(a+b+c+d+e);
}
build
arm-none-eabi-as vectors.s -o vectors.o
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-ld -T test.ld vectors.o so.o -o vectors.elf
arm-none-eabi-objdump -D vectors.elf
The code so far is not specific to arm-none-whatever or arm-linux-whatever versions of the toolchain. If/when you need gcclib items you can use gcc instead of ld but you have to be careful when doing that...or provide the path to libgcc and use ld.
What we get from this code is linker script debugging on the cheap:
Disassembly of section .text:
00008000 <_start>:
8000: 20001000 andcs r1, r0, r0
8004: 00008009 andeq r8, r0, r9
00008008 <reset>:
8008: f000 f810 bl 802c <notmain>
800c: e7fe b.n 800c <reset+0x4>
0000800e <bss_start>:
800e: 0000a008 andeq sl, r0, r8
00008012 <bss_end>:
8012: 0000a014 andeq sl, r0, r4, lsl r0
8016: 0000000c andeq r0, r0, ip
0000801a <data_rom_start>:
801a: 00008058 andeq r8, r0, r8, asr r0
0000801e <data_start>:
801e: 0000a000 andeq sl, r0, r0
00008022 <data_end>:
8022: 0000a008 andeq sl, r0, r8
8026: 00000008 andeq r0, r0, r8
...
We care about the 32 bit values being created the andeq disassembly is because the disassembler is trying to disassemble those values as instructions which they are not. The reset instructions are real the rest is 32 bit values we are generating. might be able to use readelf, but getting used to disassembling, insuring the vector table is correct as step one, which is easy to see in the disassembly. Using the disassembler as a habit can then lead to using it as above to show you what the linker generated.
If you dont get the linker script variables right you wont be able to write a successful bootstrap, if you dont have a good way to see what the linker is producing you will fail on a regular basis.
Yes, you could have exposed them in C and not assembly, the toolchain would still help you there.
You can work toward this now that you can see what the linker is doing:
.thumb
.globl _start
_start:
.word 0x20001000
.word reset
.thumb_func
reset:
ldr r0,=__bss_start__
ldr r1,=__bss_size__
# zero this
ldr r0,=__data_rom_start__
ldr r1,=__data_start__
ldr r2,=__data_size__
# copy this
bl notmain
b .
giving something like this
00008000 <_start>:
8000: 20001000 andcs r1, r0, r0
8004: 00008009 andeq r8, r0, r9
00008008 <reset>:
8008: 4803 ldr r0, [pc, #12] ; (8018 <reset+0x10>)
800a: 4904 ldr r1, [pc, #16] ; (801c <reset+0x14>)
800c: 4804 ldr r0, [pc, #16] ; (8020 <reset+0x18>)
800e: 4905 ldr r1, [pc, #20] ; (8024 <reset+0x1c>)
8010: 4a05 ldr r2, [pc, #20] ; (8028 <reset+0x20>)
8012: f000 f80b bl 802c <notmain>
8016: e7fe b.n 8016 <reset+0xe>
8018: 0000a008 andeq sl, r0, r8
801c: 0000000c andeq r0, r0, ip
8020: 00008058 andeq r8, r0, r8, asr r0
8024: 0000a000 andeq sl, r0, r0
8028: 00000008 andeq r0, r0, r8
0000802c <notmain>:
802c: 4b06 ldr r3, [pc, #24] ; (8048 <notmain+0x1c>)
802e: 6818 ldr r0, [r3, #0]
8030: 685b ldr r3, [r3, #4]
8032: 18c0 adds r0, r0, r3
If you then align the items in the linker script the copy/zero code gets even simpler you can stick to 1 to some number N whole registers rather than dealing with bytes or halfwords, can use ldr/str, ldrd/strd (if available) or ldm/stm (and not need ldrb/strb nor ldrh/strh), tight simple few line loops to complete the job.
I highly recommend you do not use C for your bootstrap.
Note that the ld linker script variables are very sensitive to position (inside or outside curly braces)
The above linker script is somewhat typical of what you will find in stock linker scripts a defined start and end, sometimes the size is computed in the linker script sometimes the bootstrap code computes the size or the bootstrap code can just loop until the address equals the end value, depends on the overall system design between the two.
Your specific issue BTW is you linked in two bootstraps, at the time I wrote this I dont see your command line(s) in the question so that would tell us more. That is why you are seeing the bss_start, etc, things that you didnt put in your linker script but are often found in stock ones that come with a pre-built toolchain (similar to the above but more complicated)
It could be by using gcc instead of ld and without the the various -nostartfiles options (that it pulled in crt0.o), just try ld instead of gcc and see what changes. You would have failed with the original example had it been something like this though so I dont think that is the issue here. If you used the same command lines the failure should have been on both examples not just the latter.
The book you're reading has led you astray. Discard it and start learning from another source.
I see at least four major problems with what it has told you to do:
The linker script and _start function you included is missing a number of important sections, and will either malfunction or fail to link many executables. Most notably, it lacks any handling for BSS (zero-filled) sections.
The vector table in main.c is beyond "minimal"; it lacks the required definitions for even the standard ARM interrupt vectors. Without these, debugging hardfaults will become very difficult, as the microcontroller will treat random code following the vector table as an interrupt vector when a fault occurs, which will probably lead to a secondary fault as it fails to load code from that "address".
The startup functions given by your book bypass the libc startup functions. This will cause some portions of the standard C library, as well as any C++ code, to fail to work correctly.
You are defining peripheral addresses yourself in main.c. These addresses are all defined in standard ST header files (e.g. <stm32f4xx.h>), so there is no need to define them yourself.
As a starter, I would recommend that you refer to the startup code provided by ST in any of their examples. These will all include a complete linker script and startup code.
As old_timer hinted in the comments, using gcc to link is a problem.
If you change the linker call in your batch file to use ld, it links without error. Try the following:
echo.
echo. Call the linker
echo.
#arm-none-eabi-ld main.o -o myApp.elf -T linkerscript.ld
I am doing an operating system implementation work.
Here's the code first :
//generate software interrupt
void generate_interrupt(int n) {
asm("mov al, byte ptr [n]");
asm("mov byte ptr [genint+1], al");
asm("jmp genint");
asm("genint:");
asm("int 0");
}
I am compiling above code with -masm=intel option in gcc. Also,
this is not complete code to generate software interrupt.
My problem is I am getting error as n undefined, how do I resolve it, please help?
Also it promts error at link time not at compile time, below is an image
When you are using GCC, you must use GCC-style extended asm to access variables declared in C, even if you are using Intel assembly syntax. The ability to write C variable names directly into an assembly insert is a feature of MSVC, which GCC does not copy.
For constructs like this, it is also important to use a single assembly insert, not several in a row; GCC can and will rearrange assembly inserts relative to the surrounding code, including relative to other assembly inserts, unless you take specific steps to prevent it.
This particular construct should be written
void generate_interrupt(unsigned char n)
{
asm ("mov byte ptr [1f+1], %0\n\t"
"jmp 1f\n"
"1:\n\t"
"int 0"
: /* no outputs */ : "r" (n));
}
Note that I have removed the initial mov and any insistence on involving the A register, instead telling GCC to load n into any convenient register for me with the "r" input constraint. It is best to do as little as possible in an assembly insert, and to leave the choice of registers to the compiler as much as possible.
I have also changed the type of n to unsigned char to match the actual requirements of the INT instruction, and I am using the 1f local label syntax so that this works correctly if generate_interrupt is made an inline function.
Having said all that, I implore you to find an implementation strategy for your operating system that does not involve self-modifying code. Well, unless you plan to get a whole lot more use out of the self-modifications, anyway.
This isn't an answer to your specific question about passing parameters into inline assembly (see #zwol's answer). This addresses using self modifying code unnecessarily for this particular task.
Macro Method if Interrupt Numbers are Known at Compile-time
An alternative to using self modifying code is to create a C macro that generates the specific interrupt you want. One trick is you need to a macro that converts a number to a string. Stringize macros are quite common and documented in the GCC documentation.
You could create a macro GENERATE_INTERRUPT that looks like this:
#define STRINGIZE_INTERNAL(s) #s
#define STRINGIZE(s) STRINGIZE_INTERNAL(s)
#define GENERATE_INTERRUPT(n) asm ("int " STRINGIZE(n));
STRINGIZE will take a numeric value and convert it into a string. GENERATE_INTERRUPT simply takes the number, converts it to a string and appends it to the end of the of the INT instruction.
You use it like this:
GENERATE_INTERRUPT(0);
GENERATE_INTERRUPT(3);
GENERATE_INTERRUPT(255);
The generated instructions should look like:
int 0x0
int3
int 0xff
Jump Table Method if Interrupt Numbers are Known Only at Run-time
If you need to call interrupts only known at run-time then one can create a table of interrupt calls (using int instruction) followed by a ret. generate_interrupt would then simply retrieve the interrupt number off the stack, compute the position in the table where the specific int can be found and jmp to it.
In the following code I get GNU assembler to generate the table of 256 interrupt call each followed by a ret using the .rept directive. Each code fragment fits in 4 bytes. The result code generation and the generate_interrupt function could look like:
/* We use GNU assembly to create a table of interrupt calls followed by a ret
* using the .rept directive. 256 entries (0 to 255) are generated.
* generate_interrupt is a simple function that takes the interrupt number
* as a parameter, computes the offset in the interrupt table and jumps to it.
* The specific interrupted needed will be called followed by a RET to return
* back from the function */
extern void generate_interrupt(unsigned char int_no);
asm (".pushsection .text\n\t"
/* Generate the table of interrupt calls */
".align 4\n"
"int_jmp_table:\n\t"
"intno=0\n\t"
".rept 256\n\t"
"\tint intno\n\t"
"\tret\n\t"
"\t.align 4\n\t"
"\tintno=intno+1\n\t"
".endr\n\t"
/* generate_interrupt function */
".global generate_interrupt\n" /* Give this function global visibility */
"generate_interrupt:\n\t"
#ifdef __x86_64__
"movzx edi, dil\n\t" /* Zero extend int_no (in DIL) across RDI */
"lea rax, int_jmp_table[rip]\n\t" /* Get base of interrupt jmp table */
"lea rax, [rax+rdi*4]\n\t" /* Add table base to offset = jmp address */
"jmp rax\n\t" /* Do sepcified interrupt */
#else
"movzx eax, byte ptr 4[esp]\n\t" /* Get Zero extend int_no (arg1 on stack) */
"lea eax, int_jmp_table[eax*4]\n\t" /* Compute jump address */
"jmp eax\n\t" /* Do specified interrupt */
#endif
".popsection");
int main()
{
generate_interrupt (0);
generate_interrupt (3);
generate_interrupt (255);
}
If you were to look at the generated code in the object file you'd find the interrupt call table (int_jmp_table) looks similar to this:
00000000 <int_jmp_table>:
0: cd 00 int 0x0
2: c3 ret
3: 90 nop
4: cd 01 int 0x1
6: c3 ret
7: 90 nop
8: cd 02 int 0x2
a: c3 ret
b: 90 nop
c: cc int3
d: c3 ret
e: 66 90 xchg ax,ax
10: cd 04 int 0x4
12: c3 ret
13: 90 nop
...
[snip]
Because I used .align 4 each entry is padded out to 4 bytes. This makes the address calculation for the jmp easier.
I - an embedded beginner - am fighting my way through the black magic world of embedded programming. So far I won already a bunch of fights, but this new bug seems to be a hard one.
First, my embedded setup:
Olimex STM32-P207 (STM32F207)
Olimex ARM-USB-OCD-H JTAG
OpenOCD
Eclipse (with CDT and GDB hardware debugging)
Codesourcery Toolchain
Startup file and linker script (adapted memory map for the STM32F207) for RIDE (what uses GCC)
STM32F2xx_StdPeriph_Lib_V1.1.0
Using the many tutorials and Q&As out there I was able to set-up makefile, linker and startup code and got some simple examples running using STM's standard library (classic blinky, using buttons and interrupts etc.). However, once I started playing around with the SysTick interrupts, things got messy.
If add the SysTick_Config() call to my code (even with an empty SysTick_Handler), ...
int main(void)
{
/*!< At this stage the microcontroller clock setting is already configured,
[...]
*/
if (SysTick_Config(SystemCoreClock / 120000))
{
// Catch config error
while(1);
}
[...]
... then my debugger starts at the (inline) function NVIC_SetPriority() and once I hit "run" I end up in the HardFault_Handler().
This only happens, when using the debugger. Otherwise the code runs normal (telling from the blinking LEDs).
I already read a lot and tried a lot (modifying compiler options, linker, startup, trying other code with SysTick_Config() calls), but nothing solved this problem.
One thing could be a hint:
The compiler starts in both cases (with and without the SysTick_Config call) at 0x00000184. In the case without the SysTick_Config call this points at the beginnig of main(). Using SysTick_Config this pionts at NVIC_SetPriority().
Does anybody have a clue what's going on? Any hint about where I could continue my search for a solution?
I don't know what further information would be helpful to solve this riddle. Please let me know and I will be happy to provide the missing pieces.
Thanks a lot!
/edit1: Added results of arm-none-eabi-readelf, -objdump and -size.
/edit2: I removed the code info to make space for the actual code. With this new simplified version debugging starts at
08000184: stmdaeq r0, {r4, r6, r8, r9, r10, r11, sp}
readelf:
[ 2] .text PROGBITS 08000184 008184 002dcc 00 AX 0 0 4
...
2: 08000184 0 SECTION LOCAL DEFAULT 2
...
46: 08000184 0 NOTYPE LOCAL DEFAULT 2 $d
main.c
/* Includes ------------------------------------------------------------------*/
#include "stm32f2xx.h"
/* Private typedef -----------------------------------------------------------*/
/* Private define ------------------------------------------------------------*/
/* Private macro -------------------------------------------------------------*/
/* Private variables ---------------------------------------------------------*/
uint16_t counter = 0;
/* Private function prototypes -----------------------------------------------*/
/* Private functions ---------------------------------------------------------*/
void assert_failed(uint8_t* file, uint32_t line);
void Delay(__IO uint32_t nCount);
void SysTick_Handler(void);
/**
* #brief Main program
* #param None
* #retval None
*/
int main(void)
{
if (SysTick_Config(SystemCoreClock / 12000))
{
// Catch config error
while(1);
}
/*
* Configure the LEDs
*/
RCC_AHB1PeriphClockCmd(RCC_AHB1Periph_GPIOF, ENABLE); // LEDs
GPIO_InitTypeDef GPIO_InitStructure;
GPIO_InitStructure.GPIO_Pin = GPIO_Pin_6 | GPIO_Pin_7 | GPIO_Pin_8 | GPIO_Pin_9;
GPIO_InitStructure.GPIO_Mode = GPIO_Mode_OUT;
GPIO_InitStructure.GPIO_OType = GPIO_OType_PP;
GPIO_InitStructure.GPIO_Speed = GPIO_Speed_2MHz;
GPIO_InitStructure.GPIO_PuPd = GPIO_PuPd_NOPULL;
GPIO_Init(GPIOF, &GPIO_InitStructure);
GPIOF->BSRRL = GPIO_Pin_6;
while (1)
{
if (GPIO_ReadInputDataBit(GPIOF, GPIO_Pin_9) == SET)
{
GPIOF->BSRRH = GPIO_Pin_9;
}
else
{
GPIOF->BSRRL = GPIO_Pin_9;
}
Delay(500000);
}
return 0;
}
#ifdef USE_FULL_ASSERT
/**
* #brief Reports the name of the source file and the source line number
* where the assert_param error has occurred.
* #param file: pointer to the source file name
* #param line: assert_param error line source number
* #retval None
*/
void assert_failed(uint8_t* file, uint32_t line)
{
/* User can add his own implementation to report the file name and line number,
ex: printf("Wrong parameters value: file %s on line %d\r\n", file, line) */
/* Infinite loop */
while (1)
{
}
}
#endif
/**
* #brief Delay Function.
* #param nCount:specifies the Delay time length.
* #retval None
*/
void Delay(__IO uint32_t nCount)
{
while (nCount > 0)
{
nCount--;
}
}
/**
* #brief This function handles Hard Fault exception.
* #param None
* #retval None
*/
void HardFault_Handler(void)
{
/* Go to infinite loop when Hard Fault exception occurs */
while (1)
{
}
}
/**
* #brief This function handles SysTick Handler.
* #param None
* #retval None
*/
void SysTick_Handler(void)
{
if (counter > 10000 )
{
if (GPIO_ReadInputDataBit(GPIOF, GPIO_Pin_8) == SET)
{
GPIOF->BSRRH = GPIO_Pin_8;
}
else
{
GPIOF->BSRRL = GPIO_Pin_8;
}
counter = 0;
}
else
{
counter++;
}
}
/edit3:
Soultion:
Because the solution is burrowed in a comment, I put it up here:
My linker file was missing ENTRY(your_function_of_choice); (e.g. Reset_Handler). Adding this made my debugger work again (it now starts at the right point).
Thanks everybody!
I have a strong feeling that the entry point isn't specified correctly in the compiler options or linker script...
And whatever code comes first at link time in the ".text" section, it gets to have the entry point.
The entry point, however, should point to a special "init" function that would set up the stack, initialize the ".bss" section (and maybe some other sections as well), initialize any hardware that's necessary for basic operation (e.g. the interrupt controller and maybe some system timer) and any remaining portions of the and standard libraries before actually transferring control to main().
I'm not familiar with the tools and your hardware, so I can't say exactly what that special "init" function is, but that's pretty much the problem. It's not being pointed to by the entry point of the compiled program. NVIC_SetPriority() doesn't make any sense there.
How to write this assembly code as inline assembly? Compiler: gcc(i586-elf-gcc). The GAS syntax confuses me. Please give tell me how to write this as inline assembly that works for gcc.
.set_video_mode:
mov ah,00h
mov al,13h
int 10h
.init_mouse:
mov ax,0
int 33h
Similar one I have in assembly. I wrote them separate as assembly routines to call them from my C program. I need to call these and some more interrupts from C itself.
Also I need to put some values in some registers depending on which interrupt routine I'm calling. Please tell me how to do it.
All that I want to do is call interrupt routines from C. It's OK for me even to do it using int86() but i don't have source code of that function.
I want int86() so that i can call interrupts from C.
I am developing my own tiny OS so i got no restrictions for calling interrupts or for any direct hardware access.
I've not tested this, but it should get you started:
void set_video_mode (int x, int y) {
register int ah asm ("ah") = x;
register int al asm ("al") = y;
asm volatile ("int $0x10"
: /* no outputs */
: /* no inputs */
: /* clobbers */ "ah", "al");
}
I've put in two 'clobbers' as an example, but you'll need to set the correct list of clobbers so that the compiler knows you've overwritten register values (maybe none).
First, keep in mind GCC doesn't support 16-bit code yet, so you'll end up compiling 32-bit code in 16-bit mode, which is very inefficient but doable (it is used, for example, by Linux and SeaBIOS). It can be done with the following at the begging of each file:
__asm__ (".code16gcc");
Newer GCC versions (since 4.9 IIRC) support the -m16 flag that does the same thing.
Also, there's no mouse driver available unless you load it previous to your kernel running init_mouse.
You seem to be using an API commonly available in several x86 DOS.
asm can take care of the register assignments, so the code can be reduced to:
void set_video_mode(int mode)
{
mode &= 255;
__asm__ __volatile__ (
"int $0x10"
: "+a" (mode) /* %eax = mode & 255 => %ah = 0, %al = mode */
);
}
void init_mouse(void)
{
/* XXX it is really important to check the IDT entry isn't 0 */
int tmp = 0;
__asm__ __volatile__ (
"int $0x33"
: "+a" (tmp) /* %eax = 0*/
:: "ebx" /* %ebx is also clobbered by DOS mouse drivers */
);
}
The asm statement is documented in the GCC manual, although perhaps not in enough depth and lacks x86 examples. The outputs (after first colon) have a distinctively obscure syntax, while the rest is far easier to understand (the second colon specifies the inputs and the third the clobbered registers, flags and/or memory).
The outputs must be prefixed with =, meaning you don't care the previous value it may have had, or +, meaning you want to use it as an input too. In this context we use that instead of an input because the value is modified by the interrupt and you're not allowed to specify input registers in the clobbered list (because the compiler is forbidden from using them).