Shared Memory's atomicAdd with int and float have different SASS

Shared Memory's atomicAdd with int and float have different SASS - sass

I encountered a performance issue, where the shared memory's atomicAdd on float is much more expensive than it on int after profiling with nv-nsight-cu-cli.
After checking the generated SASS, I found the generated SASS of the shared memory's atomicAdd on float and int are not similar at all.
Here I show a example in minimal cuda code:
$ cat test.cu
__global__ void testAtomicInt() {
__shared__ int SM_INT;
SM_INT = 0;
__syncthreads();
atomicAdd(&(SM_INT), ((int)1));
}
__global__ void testAtomicFloat() {
__shared__ float SM_FLOAT;
SM_FLOAT = 0.0;
__syncthreads();
atomicAdd(&(SM_FLOAT), ((float)1.1));
}
$ nvcc -arch=sm_86 -c test.cu
$ cuobjdump -sass test.o
Fatbin elf code:
================
arch = sm_86
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_86
Function : _Z15testAtomicFloatv
.headerflags #"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fc40000000f00 */
/*0010*/ STS [RZ], RZ ; /* 0x000000ffff007388 */
/* 0x000fe80000000800 */
/*0020*/ BAR.SYNC 0x0 ; /* 0x0000000000007b1d */
/* 0x000fec0000000000 */
/*0030*/ LDS R2, [RZ] ; /* 0x00000000ff027984 */
/* 0x000e240000000800 */
/*0040*/ FADD R3, R2, 1.1000000238418579102 ; /* 0x3f8ccccd02037421 */
/* 0x001fcc0000000000 */
/*0050*/ ATOMS.CAST.SPIN R3, [RZ], R2, R3 ; /* 0x00000002ff03738d */
/* 0x000e240001800003 */
/*0060*/ ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ; /* 0x000000010300780c */
/* 0x001fda0003f02070 */
/*0070*/ #!P0 BRA 0x30 ; /* 0xffffffb000008947 */
/* 0x000fea000383ffff */
/*0080*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0090*/ BRA 0x90; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*00a0*/ NOP; /* 0x0000000000007918 */
..........
Function : _Z13testAtomicIntv
.headerflags #"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fc40000000f00 */
/*0010*/ STS [RZ], RZ ; /* 0x000000ffff007388 */
/* 0x000fe80000000800 */
/*0020*/ BAR.SYNC 0x0 ; /* 0x0000000000007b1d */
/* 0x000fec0000000000 */
/*0030*/ ATOMS.POPC.INC.32 RZ, [URZ] ; /* 0x00000000ffff7f8c */
/* 0x000fe2000d00003f */
/*0040*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0050*/ BRA 0x50; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0060*/ NOP; /* 0x0000000000007918 */
..........
Fatbin ptx code:
================
arch = sm_86
code version = [7,5]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
From the generated SASS code above, we could clearly obtain, that the shared memory's atomicAdd on int generates single lightweight ATOMS.POPC.INC.32 RZ, [URZ], while it on float generating a bunch of SASS with a heavyweight ATOMS.CAST.SPIN R3, [RZ], R2, R3 .
The CUDA Binary Utilities doesn't show me the meaning of CAST or SPIN. However, I could guess it means an exclusive spin lock on a shared memory address. (Correct me, if my guess goes wrong.)
In my real code, none of the SASS of atomicAdd of int has a hotspot. However, this ATOMS.CAST.SPIN is significantly hotter than other SASS code generated by of the atomicAdd of float.
In addition, I tested with compiler flag -arch=sm_86, -arch=sm_80 and -arch=sm_75. Under those CCs, the generated SPSS code of atomicAdd of float is very similar. Another fact is, with no surprise, the atomicAdd of double generates SPSS alike it of float.
This observation caused me more confusion than questions. I would go with some simple questions from my profiling experience and hope we could have a nice discussion.
What does exactly ATOMS.CAST.SPIN do? The only SASS document I am aware of is the CUDA Binary Utilities.
Why should the atomicAdd of float generates more SASS code and does more work than it on int? I know it is a general question and hard to be answered. Maybe the ATOMS.POPC.INC simply doesn't apply to data type float or double?
If it is more vulnerable to have more shared memory load and store conflict and thus more stall time for the atomicAdd of float than the atomicAdd of int? The former has clearly more instruction to be executed and divergent branches. I have the following code snippet in my project where the number of function calls on two functions is the same. However, the atomicAdd of float builds a runtime bottleneck while it is on int doesn't.
atomicAdd(&(SM_INT), ((int)1)); // no hotspot
atomicAdd(&(SM_FLOAT), ((float)1.1)); // a hotspot

I probably won't be able to provide an answer addressing every possible question. CUDA SASS is really not documented to the level to explain these things.
What does exactly ATOMS.CAST.SPIN do? The only SASS document I am aware of is the CUDA Binary Utilities.
ATOMS.CAST.SPIN
^^^^^ ^^^
|| |
|| compare and swap
|shared
atomic
The programming guide gives an indication of how one can implement an "arbitrary" atomic operation, using atomic CAS (Compare And Swap). You should first familiarize yourself with how atomic CAS works.
Regarding the "arbitrary atomic" example, the thing to note is that it can evidently be used to provide atomic operations for e.g. datatypes that are not supported by a "native" atomic instruction, such as atomic add. Another thing to note is that it is essentially a loop around an atomic CAS instruction, with the loop checking to see if the operation was "successful" or not. If it was "unsuccessful", the loop continues. If it was "successful", the loop exits.
This is effectively what we see depicted in SASS code in your float example:
/*0030*/ LDS R2, [RZ] ; // get the current value in the location
FADD R3, R2, 1.1000000238418579102 ; // perform ordinary floating-point add
ATOMS.CAST.SPIN R3, [RZ], R2, R3 ; // attempt to atomically replace the result in the location
ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ; // check if replacement was successful
#!P0 BRA 0x30 // if not, loop and try again
These are essentially the steps that are outlined in the "arbitrary atomic" example in the programming guide. Based on this I would conclude the following:
the architecture you compiled for does not actually have a "native" atomic operation of the type you are requesting
the atomic operation you are requesting can be done using the looping method
the compiler tool chain (typically ptxas, but could also be the JIT system), as a convenience feature, is automatically implementing this looping method for you, rather than throwing a compile error
Why should the atomicAdd of float generates more SASS code and does more work than it on int?
Evidently, the architecture you are compiling for does not have a "native" implementation of atomic add for float, and so the compiler tool chain has chosen to implement this looping method for you. Since the loop effectively involves the possibility of success/failure which will determine whether this loop continues, and success/failure depends on other threads behavior (contention to perform the atomic), the looping method may do considerably more "work" than a native single instruction will.
If it is more vulnerable to have more shared memory load and store conflict and thus more stall time for the atomicAdd of float than the atomicAdd of int?
Yes, I personally would conclude that the native atomic method is more efficient, and the looping method may be less efficient, which could be expressed in a variety of ways in the profiler, such as warp stalls.
EDIT:
It's possible for things to be implemented/available in one GPU architecture but not another. This is certainly applicable to atomics, and you can see examples of this if you read the previously linked section on atomics in the programming guide. I don't know of any architectures today that are "newer" than cc8.0 or cc8.6 (Ampere) but it is certainly possible that the behavior of a future (or any other) GPU could be different here.
This loop-around-atomicCAS method is distinct from a previous methodology (lock/update/unlock, which also involves a loop for lock negotiation) the compiler toolchain used on Kepler and prior architectures to provide atomics on shared memory when no formal SASS instructions existed to do so.

Related

what type of code can trigger unaligned data access sigbus trap dynamically?

I am looking for SIGBUS on unaligned data access. I am tracking one of this errors and I would like to know how this is happening on sitara am335x. Can someone please give me an example code to describe this or ensure triggering it.
Adding code snippet:
int Read( void *value, uint32_t *size, const uint32_t baseAddress )
{
uint8_t *userDataAddress = (uint8_t *)( baseAddress + sizeof( DBANode ));
memcpy( value, userDataAddress, ourDataSize );
*size = ourDataSize;
return 0;
}
DBA node is a class object of 20 bytes.
baseAddress is an mmap to a shared memory file again of a class object type of DBANode casted to a uint32_t so that the arithmetic can be done.
This is the dissasembly of the section:
91a8: e51b3010 ldr r3, [fp, #-16]
91ac: e5933000 ldr r3, [r3]
91b0: e51b0014 ldr r0, [fp, #-20] ; 0xffffffec
91b4: e51b1008 ldr r1, [fp, #-8]
91b8: e1a02003 mov r2, r3
91bc: ebffe72b bl 2e70 <memcpy#plt>
91c0: e51b3010 ldr r3, [fp, #-16]
91c4: e5932000 ldr r2, [r3]
91c8: e51b3018 ldr r3, [fp, #-24] ; 0xffffffe8
91cc: e5832000 str r2, [r3]
00002e70 <memcpy#plt>:
2e70: e28fc600 add ip, pc, #0, 12
2e74: e28cca08 add ip, ip, #8, 20 ; 0x8000
2e78: e5bcf868 ldr pc, [ip, #2152]! ; 0x868
When the exact same code base was re-built, the problem just disappeared. Can the gcc create 2 different versions of instructions with same optimization of -O0 specified for gcc ?
I also diffed the library so files obj dumps in both compilations. They are exactly the same. The api is used quite often. However, the crash only happens after prolonged use over a few days. I am reading the same node every 500ms. So this is not consistent.
Should I be looking at pointer corruption ?

Turns out the baseAddress is the issue. As I mentioned its an mmap to an shared memory location where the mmap can fail. failed mmap returns -1 and the code was checking for NULL and proceeding to write to -1 i.e 0xFFFFFFFF causing a sigbus.
The code 1 is seen when we use memcpy. Trying any other access like a direct byte addressing gives a code 3 with sigbus.
I am still not sure why it triggers SIGBUS instead of SIGSEGV. Shouldn't this be a memory violation instead ?
Here is an example:
int main(int argc, char **argv)
{
// Shared memory example
const char *NAME = "SharedMemory";
const int SIZE = 10 * sizeof(uint8_t);
uint8_t src[]={0x11,0x22,0x33,0x44,0x55,0x66,0x77,0x88,0x99,0x00};
int shm_fd = -1;
shm_fd = shm_open(NAME, O_CREAT | O_RDONLY, 0666);
ftruncate(shm_fd, SIZE);
// Map shared memory segment to address space
uint8_t *ptr = (uint8_t *) mmap(0, SIZE, PROT_READ | PROT_WRITE | _NOCACHE, MAP_SHARED, shm_fd, 0);
if(ptr == MAP_FAILED)
{
std::cerr << "ERROR in mmap()" << std::endl;
// return -1;
}
printf("ptr = 0x%08x\n",ptr);
std::cout << "Now storing data to mmap() memory" << std::endl;
#if 0
ptr[0] = 0x11;
ptr[1] = 0x22;
ptr[2] = 0x33;
ptr[3] = 0x44;
ptr[4] = 0x55;
ptr[5] = 0x66;
ptr[6] = 0x77;
ptr[7] = 0x88;
ptr[8] = 0x99;
ptr[9] = 0x00;
#endif
memcpy(ptr,src,SIZE); //causes sigbus code 1
shm_unlink(NAME);
}
I still do not know why mmap is failing on an shm even though I have a 100MB of RAM available and all my resource limits are set to unlimited with over 400 fds (file descriptors) still available out of 1000 fds limit. !!!

From the Cortex-A8 Technical Reference Manual:
The processor supports loads and stores of unaligned words and
halfwords. The processor makes the required number of memory accesses
and transfers adjacent bytes transparently.
Note Data accesses that cross a word boundary can add to the access time.
Setting the A bit in the CP15 c1 Control Register enables alignment
checking. When the A bit is set to 1, two types of memory access
generate a Data Abort signal and an Alignment fault status code:
a 16-bit access that is not halfword-aligned
a 32-bit load or store that is not word-aligned
Alignment fault detection is a mandatory address-generation function
rather than an optionally supported function of external memory
management hardware.See the ARM Architecture Reference Manual for
more information on unaligned data access support.
From the ARM ARM, instructions which always generate an alignment fault if not aligned to the transfer size:
LDREX, STREX, LDREXD, STREXD, LDM, STM, LDRD, RFE, SRS, STRD, SWP, LDC, LDC2, STC, STC2, VLDM, VLDR, VPOP, VPUSH, VSTM, VSTR.
Also, most PUSH, POP and VLDx where :align: is specified.
Further,
In an implementation that includes the Virtualization Extensions, an
unaligned access to Device or Strongly-ordered memory always causes an
Alignment fault Data Abort exception
As in the linked question, structs are the most obvious way to cause 'intended' unaligned accesses, but any corruption of the stack pointer or other variable pointer would also give the same result. Depending on how the core is configured will affect if normal single word/halfword accesses are just slow, or trigger a fault.
If you have access to the ETM trace, you would be able to identify the exact accesses. It seems that part has ETM/ETB (so no fancy trace capture device is required), but I've no idea how easy it will be to get tools to work with it.
As regards what code can trigger this, yes, even memcpy() could be a problem. Since the ARM instruction set has optimisations for transferring multiple registers (or register pairs in AA64), the optimised library functions will prefer to 'stream' data rather than perform byte by byte load and stores. Depending on the data structures and compilation target, it is perfectly possible to end up with illegal LDM to unaligned addresses.

gcc with intel x86-32 bit assembly : accessing C function arguments

I am doing an operating system implementation work.
Here's the code first :
//generate software interrupt
void generate_interrupt(int n) {
asm("mov al, byte ptr [n]");
asm("mov byte ptr [genint+1], al");
asm("jmp genint");
asm("genint:");
asm("int 0");
}
I am compiling above code with -masm=intel option in gcc. Also,
this is not complete code to generate software interrupt.
My problem is I am getting error as n undefined, how do I resolve it, please help?
Also it promts error at link time not at compile time, below is an image

When you are using GCC, you must use GCC-style extended asm to access variables declared in C, even if you are using Intel assembly syntax. The ability to write C variable names directly into an assembly insert is a feature of MSVC, which GCC does not copy.
For constructs like this, it is also important to use a single assembly insert, not several in a row; GCC can and will rearrange assembly inserts relative to the surrounding code, including relative to other assembly inserts, unless you take specific steps to prevent it.
This particular construct should be written
void generate_interrupt(unsigned char n)
{
asm ("mov byte ptr [1f+1], %0\n\t"
"jmp 1f\n"
"1:\n\t"
"int 0"
: /* no outputs */ : "r" (n));
}
Note that I have removed the initial mov and any insistence on involving the A register, instead telling GCC to load n into any convenient register for me with the "r" input constraint. It is best to do as little as possible in an assembly insert, and to leave the choice of registers to the compiler as much as possible.
I have also changed the type of n to unsigned char to match the actual requirements of the INT instruction, and I am using the 1f local label syntax so that this works correctly if generate_interrupt is made an inline function.
Having said all that, I implore you to find an implementation strategy for your operating system that does not involve self-modifying code. Well, unless you plan to get a whole lot more use out of the self-modifications, anyway.

This isn't an answer to your specific question about passing parameters into inline assembly (see #zwol's answer). This addresses using self modifying code unnecessarily for this particular task.
Macro Method if Interrupt Numbers are Known at Compile-time
An alternative to using self modifying code is to create a C macro that generates the specific interrupt you want. One trick is you need to a macro that converts a number to a string. Stringize macros are quite common and documented in the GCC documentation.
You could create a macro GENERATE_INTERRUPT that looks like this:
#define STRINGIZE_INTERNAL(s) #s
#define STRINGIZE(s) STRINGIZE_INTERNAL(s)
#define GENERATE_INTERRUPT(n) asm ("int " STRINGIZE(n));
STRINGIZE will take a numeric value and convert it into a string. GENERATE_INTERRUPT simply takes the number, converts it to a string and appends it to the end of the of the INT instruction.
You use it like this:
GENERATE_INTERRUPT(0);
GENERATE_INTERRUPT(3);
GENERATE_INTERRUPT(255);
The generated instructions should look like:
int 0x0
int3
int 0xff
Jump Table Method if Interrupt Numbers are Known Only at Run-time
If you need to call interrupts only known at run-time then one can create a table of interrupt calls (using int instruction) followed by a ret. generate_interrupt would then simply retrieve the interrupt number off the stack, compute the position in the table where the specific int can be found and jmp to it.
In the following code I get GNU assembler to generate the table of 256 interrupt call each followed by a ret using the .rept directive. Each code fragment fits in 4 bytes. The result code generation and the generate_interrupt function could look like:
/* We use GNU assembly to create a table of interrupt calls followed by a ret
* using the .rept directive. 256 entries (0 to 255) are generated.
* generate_interrupt is a simple function that takes the interrupt number
* as a parameter, computes the offset in the interrupt table and jumps to it.
* The specific interrupted needed will be called followed by a RET to return
* back from the function */
extern void generate_interrupt(unsigned char int_no);
asm (".pushsection .text\n\t"
/* Generate the table of interrupt calls */
".align 4\n"
"int_jmp_table:\n\t"
"intno=0\n\t"
".rept 256\n\t"
"\tint intno\n\t"
"\tret\n\t"
"\t.align 4\n\t"
"\tintno=intno+1\n\t"
".endr\n\t"
/* generate_interrupt function */
".global generate_interrupt\n" /* Give this function global visibility */
"generate_interrupt:\n\t"
#ifdef __x86_64__
"movzx edi, dil\n\t" /* Zero extend int_no (in DIL) across RDI */
"lea rax, int_jmp_table[rip]\n\t" /* Get base of interrupt jmp table */
"lea rax, [rax+rdi*4]\n\t" /* Add table base to offset = jmp address */
"jmp rax\n\t" /* Do sepcified interrupt */
#else
"movzx eax, byte ptr 4[esp]\n\t" /* Get Zero extend int_no (arg1 on stack) */
"lea eax, int_jmp_table[eax*4]\n\t" /* Compute jump address */
"jmp eax\n\t" /* Do specified interrupt */
#endif
".popsection");
int main()
{
generate_interrupt (0);
generate_interrupt (3);
generate_interrupt (255);
}
If you were to look at the generated code in the object file you'd find the interrupt call table (int_jmp_table) looks similar to this:
00000000 <int_jmp_table>:
0: cd 00 int 0x0
2: c3 ret
3: 90 nop
4: cd 01 int 0x1
6: c3 ret
7: 90 nop
8: cd 02 int 0x2
a: c3 ret
b: 90 nop
c: cc int3
d: c3 ret
e: 66 90 xchg ax,ax
10: cd 04 int 0x4
12: c3 ret
13: 90 nop
...
[snip]
Because I used .align 4 each entry is padded out to 4 bytes. This makes the address calculation for the jmp easier.

STM32F1 RTC_EnterConfigMode not always setting Config Mode

Preface: This is all being evaluated when attached to the target using an ST-Link and in debug mode in IAR Embedded Workbench IDE.
The Real Time Clock in the STM32F1 is supported in the Standard Peripheral Libraries provided by STM. I'm trying to set the RTC to 107301722, or "Sat, 26 May 2013 22:02:02 GMT", using RTC_SetCounter().
void RTC_SetCounter(uint32_t CounterValue) /*From Std Periph Lib */
{
RTC_EnterConfigMode();
/* Set RTC COUNTER MSB word */
RTC->CNTH = CounterValue >> 16;
/* Set RTC COUNTER LSB word */
RTC->CNTL = (CounterValue & RTC_LSB_MASK);
RTC_ExitConfigMode();
}
Note that it calls RTC_EnterConfigMode(), which is a requirement for modifying RTC register values: "To write in the RTC_PRL, RTC_CNT, RTC_ALR registers, the peripheral must enter Configuration Mode. This is done by setting the CNF bit in the RTC_CRL register."
void RTC_EnterConfigMode(void) /*From Std Periph Lib */
{
/* Set the CNF flag to enter in the Configuration Mode */
RTC->CRL |= RTC_CRL_CNF;
}
This is the code for entering config mode. Simple enough. And here's the disassembly (no optimizations are enabled). The 0x10 is the bit position of the CNF flag.
// RTC->CRL |= RTC_CRL_CNF;
RTC_EnterConfigMode:
0x8053ed6: 0x4829 LDR.N R0, ??DataTable13_1 ; RTC_CRL
0x8053ed8: 0x8800 LDRH R0, [R0]
0x8053eda: 0xf050 0x0010 ORRS.W R0, R0, #16 ; 0x10
0x8053ede: 0x4927 LDR.N R1, ??DataTable13_1 ; RTC_CRL
0x8053ee0: 0x8008 STRH R0, [R1]
//}
0x8053ee2: 0x4770 BX LR
What I've found is if I break anywhere from the call to RTC_SetCounter() to the disassembly at line 0x8053ee0, Config Mode gets enabled, but if I move the breakpoint to the disassembly at line 0x8053ee2 or later, Config Mode does not get set, and therefore the RTC does not get set.
I have not tried anything in the realm of trying to analyze what happens in a non-debug setting simply because part of what I'm working toward is a unit test involving setting the time. The unit test will require debugger attachment.
Is this strictly a debugger problem? Are there any rational reasons to explain this behavior that could lead to a workable solution?

It would turn out that I have overlooked a very important function that is provided to allow current RTC register actions which are not complete to finish: RTC_WaitForLastTask().
/**
* #brief Waits until last write operation on RTC registers has finished.
* #note This function must be called before any write to RTC registers.
* #param None
* #retval None
*/
void RTC_WaitForLastTask(void)
{
/* Loop until RTOFF flag is set */
while ((RTC->CRL & RTC_FLAG_RTOFF) == (uint16_t)RESET)
{
}
}
If I had paid more attention to the other register flags that were set in RTC_CRL, I might have noticed that RTOFF was an issue.

atomic_inc and atomic_xchg in gcc assembly

I have written the following user-level code snippet to test two sub functions, atomic inc and xchg (refer to Linux code).
What I need is just try to perform operations on 32-bit integer, and that's why I explicitly use int32_t.
I assume global_counter will be raced by different threads, while tmp_counter is fine.
#include <stdio.h>
#include <stdint.h>
int32_t global_counter = 10;
/* Increment the value pointed by ptr */
void atomic_inc(int32_t *ptr)
{
__asm__("incl %0;\n"
: "+m"(*ptr));
}
/*
* Atomically exchange the val with *ptr.
* Return the value previously stored in *ptr before the exchange
*/
int32_t atomic_xchg(uint32_t *ptr, uint32_t val)
{
uint32_t tmp = val;
__asm__(
"xchgl %0, %1;\n"
: "=r"(tmp), "+m"(*ptr)
: "0"(tmp)
:"memory");
return tmp;
}
int main()
{
int32_t tmp_counter = 0;
printf("Init global=%d, tmp=%d\n", global_counter, tmp_counter);
atomic_inc(&tmp_counter);
atomic_inc(&global_counter);
printf("After inc, global=%d, tmp=%d\n", global_counter, tmp_counter);
tmp_counter = atomic_xchg(&global_counter, tmp_counter);
printf("After xchg, global=%d, tmp=%d\n", global_counter, tmp_counter);
return 0;
}
My 2 questions are:
Are these two subfunctions written properly?
Will this behave the same when I compile this on 32-bit or
64-bit platform? For example, could the pointer address have a different
length. or could incl and xchgl will conflict with the operand?

My understanding of this question is below, please correct me if I'm wrong.
All the read-modify-write instructions (ex: incl, add, xchg) need a lock prefix. The lock instruction is to lock the memory accessed by other CPUs by asserting LOCK# signal on the memory bus.
The __xchg function in Linux kernel implies no "lock" prefix because xchg always implies lock anyway. http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/cmpxchg_64.h#L15
However, the incl used in atomic_inc does not have this assumption so a lock_prefix is needed.
http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/atomic.h#L105
btw, I think you need to copy the *ptr to a volatile variable to avoid gcc optimization.
William

Rewrite Intel-style assembly code into GCC inline assembly

How to write this assembly code as inline assembly? Compiler: gcc(i586-elf-gcc). The GAS syntax confuses me. Please give tell me how to write this as inline assembly that works for gcc.
.set_video_mode:
mov ah,00h
mov al,13h
int 10h
.init_mouse:
mov ax,0
int 33h
Similar one I have in assembly. I wrote them separate as assembly routines to call them from my C program. I need to call these and some more interrupts from C itself.
Also I need to put some values in some registers depending on which interrupt routine I'm calling. Please tell me how to do it.
All that I want to do is call interrupt routines from C. It's OK for me even to do it using int86() but i don't have source code of that function.
I want int86() so that i can call interrupts from C.
I am developing my own tiny OS so i got no restrictions for calling interrupts or for any direct hardware access.

I've not tested this, but it should get you started:
void set_video_mode (int x, int y) {
register int ah asm ("ah") = x;
register int al asm ("al") = y;
asm volatile ("int $0x10"
: /* no outputs */
: /* no inputs */
: /* clobbers */ "ah", "al");
}
I've put in two 'clobbers' as an example, but you'll need to set the correct list of clobbers so that the compiler knows you've overwritten register values (maybe none).

First, keep in mind GCC doesn't support 16-bit code yet, so you'll end up compiling 32-bit code in 16-bit mode, which is very inefficient but doable (it is used, for example, by Linux and SeaBIOS). It can be done with the following at the begging of each file:
__asm__ (".code16gcc");
Newer GCC versions (since 4.9 IIRC) support the -m16 flag that does the same thing.
Also, there's no mouse driver available unless you load it previous to your kernel running init_mouse.
You seem to be using an API commonly available in several x86 DOS.
asm can take care of the register assignments, so the code can be reduced to:
void set_video_mode(int mode)
{
mode &= 255;
__asm__ __volatile__ (
"int $0x10"
: "+a" (mode) /* %eax = mode & 255 => %ah = 0, %al = mode */
);
}
void init_mouse(void)
{
/* XXX it is really important to check the IDT entry isn't 0 */
int tmp = 0;
__asm__ __volatile__ (
"int $0x33"
: "+a" (tmp) /* %eax = 0*/
:: "ebx" /* %ebx is also clobbered by DOS mouse drivers */
);
}
The asm statement is documented in the GCC manual, although perhaps not in enough depth and lacks x86 examples. The outputs (after first colon) have a distinctively obscure syntax, while the rest is far easier to understand (the second colon specifies the inputs and the third the clobbered registers, flags and/or memory).
The outputs must be prefixed with =, meaning you don't care the previous value it may have had, or +, meaning you want to use it as an input too. In this context we use that instead of an input because the value is modified by the interrupt and you're not allowed to specify input registers in the clobbered list (because the compiler is forbidden from using them).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio