Inline assembly using insl instruction - passing variables in C to registers

Inline assembly using insl instruction - passing variables in C to registers - gcc

Recently, I read Xv6, a simple Unix-like teaching operating system and its source code. There is one example of inline assembly, which I cannot find answer to my question about the implementation. There is a stackoverflow article How to use ins instruction with GNU assembler, which cannot help answer that either.
The following is the code snippet I have trouble to understand how inline assembly works for insl instruction.
0461 static inline void
0462 insl(int port, void *addr, int cnt)
0463 {
0464 asm volatile("cld; rep insl" :
0465 "=D" (addr), "=c" (cnt) :
0466 "d" (port), "0" (addr), "1" (cnt) :
0467 "memory", "cc");
0468 }
And one calling function looks like this.
8573 insl(0x1F0, dst, SECTSIZE/4);
My understanding about insl instruction in GCC transfers a string (of 4 bytes) from a port specified in the DX register to the memory (of double words) pointed to by the ES:EDI as destination index. And, Use the rep prefix with the insl instruction for a block transfer of CX double words. The questions I have is, not sure how GCC will pass the values of variables port, *addr, and cnt to the registers of DX, EDI, and CX? I tried to find more information in GCC manual, but not. (Please forgive me as a GCC novice. Maybe I did not exhaust all the ways to find it yet.)
I tried to imagine that could be related to line 0466, "d" (port), "0" (addr) and "1" (cnt). If it is true, I still don’t understand why the first argument is listed as "d", and next ones starting from position 0 with "0" (addr) and "1" (cnt). Why not "0" (port), "1" (addr), "2" (cnt)? And any rule of the mapping to registers DX, EDI, and CX?

Related

Intel Processor : "If CPUID.06H:EAX.[7] = 1" Meaning?

What does the below comment mean in Intel Processor data sheet? How can I check that in my driver's code in Linux?
If CPUID.06H:EAX.[7] = 1
I come across this type of statements in the SW developer's manual for Intel Processors in the comments column in register description table..
Ref: https://software.intel.com/sites/default/files/managed/22/0d/335592-sdm-vol-4.pdf
Please help me to understand the processor descriptions.
Thanks.

CPUID.06H:EAX.[7] enables the discovery of the HWP (HARDWARE-CONTROLLED PERFORMANCE STATES) support in an Intel processor.
If that feature-bit isn't set, the feature does not exist. Accessing the HWP MSRs will result in #GP exception. For other feature-bits, the result may be worse: no fault but causing hard-to-debug problems later way.
It means that you should check a bit in the eax register (specifically eax & (1<<7)) after running the cpuid instruction with eax=0x6 before the instruction.
To discover the CPUID value in Linux kernel you can use one of the cpuid functions, with op=6:
#include <asm/processor.h> // defines the following:
void cpuid(unsigned int op,
unsigned int *eax, unsigned int *ebx,
unsigned int *ecx, unsigned int *edx);
void cpuid_count(unsigned int op, int count,
unsigned int *eax, unsigned int *ebx,
unsigned int *ecx, unsigned int *edx);
Or for cases where you only want one register result:
unsigned int cpuid_eax(unsigned int op);
And mask the EAX returned value.
One small note: when quoting the SDM it is better to specify a section - it is a big book after all...

Question 1
What does the below comment mean in Intel Processor data sheet?
If CPUID.06H:EAX.[7] = 1
The cpuid is a special processor instruction used to discover details of the processor (check the availability of some uncommon features for eg). It implicitly uses the EAX register as parameter and returns the result in EAX, EBX, ECX, EDX.
In Intel manual, the general format is:
CPUID.EAX_VALUE:RETURN_REGISTER.[BIT_NUMBER] = 1 : means that If you execute CPUID instruction with EAX register = EAX_VALUE, you will get the result in RETURN_REGISTER. If the bit number BIT_NUMBER is set, so it has some special mean you can find in the manuals.
CPUID.06H:EAX.[7]=1 means that If you execute CPUID instruction with register EAX = 06H (6 in hexadecimal) as input, you will get the result in EAX. If the bit number 7 of the result is 1, so it has some special mean you can find in the manuals. For eg. I read this in the manual:
Availability of HWP baseline resource and capability,
CPUID.06H:EAX[bit 7]: If this bit is set (that is if
CPUID.06H:EAX.[7]=1), HWP provides several new architectural MSRs:
IA32_PM_ENABLE, IA32_HWP_CAPABILITIES,
IA32_HWP_REQUEST,IA32_HWP_STATUS.
Question 2
How can I check that in my driver's code in Linux?
if(cpuid_eax(0x06) & (1<<7)){
// Good news :features availlable, do your job
}else{
// Bad luck ::
return;
}

NEON inline assembly - store query

I am trying to learn how to utilize NEON using gcc and inline assembly.
While it is confusing and slow going, I making some progress (It's been 10 years since I last tried writing assembly).
My simple program loads a (small) vector, saturation sums it, and stores it. The problem I am having is that I cannot seem to store the result in the place I want.
When I use an unused array pointer (r) in my output list, I get an error "impossible constraint in asm". If I then create a second pointer to it (rptr), it assembles, but it re-uses an input register r2 which is a, effectively overwriting the input.
(I know my arrays are 32 elements in size and that I'm only processing one element, I plan to try to loop, or try load more registers for parallel processing next)
void vecSum()
{
//two input arrays of 32 bit types, one output
int32_t a[32];
int32_t b[32];
int32_t r[32];
//initialize
for(int cnt = 0; cnt < 32; cnt++)
{
a[cnt] = 0x33333333;
b[cnt] = 0x11111111;
r[cnt] = 0;
}
void *rptr = r;
__asm__ volatile(
"vld1.32 {d0},[%[ina]]!\n" //load the neon register with our data at a, post increment the reg
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n" //store the answer
: [result]"=r" (rptr) /*r*/
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
for(int g=0; g < 32; g++)
{
printf("0x[%d]%x ",g,a[g]);
}
}
Objdump:
for(int cnt = 0; cnt < 32; cnt++)
780: e3530080 cmp r3, #128 ; 0x80
784: 1afffff7 bne 768 <_Z8vecSum32v+0x28>
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n"
: [result]"=r" (rptr)
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
788: f422078f vld1.32 {d0}, [r2]
78c: f421178d vld1.32 {d1}, [r1]!
790: f2200011 vqadd.s32 d0, d0, d1
794: f402078f vst1.32 {d0}, [r2]
In summary, if I try vst1.32 d0,[%[result]] where result is the array pointer r, I get a compilation error. If I rptr ( another pointer to r) it comiles, but uses r2 (the array a) as the output.
Can anybody explain why I get the error outputting to r? And why the ptr to r is a?

rptr is declared as an output when it should be an input and "memory" is missing from the clobber list.
Alternatively you may put the arrays in structs and use the structs (rather than pointers) as arguments to the asm statement.

Consider if the asm contained add %[result], %[ina], %[inb]. There's no harm whatsoever in allocating r2 for both result and ina there. Since GCC doesn't go analysing the contents of the asm statement, its default assumption is that it contains a single instruction like that, so if yours is more complicated then you need to say so in order for things to work as expected.
Specifically, to prevent the problematic overlapping register allocation here, you need to be honest about the fact that you that your asm modifies the input registers - most simply via the + modifier (which then actually makes them outputs as far as GCC is concerned). Another unpleasant side effect of not doing that, is that the compiler would assume that e.g. r1 still holds the address of b afterwards, and may generate later code relying on that which will then go horribly wrong thanks to what the asm actually did.
Furthermore, you don't modify the result pointer, and only use its value as an input, so saying it's a write-only output operand is very wrong.
As for the issue with r, well, by specifying it as an output operand, you're saying that the asm writes a value back to that variable. Except you can't do that with an array variable in C (<languagelawyer> arrays are not modifiable lvalues) - you need to give the asm a variable which holds the address of the array and can be assigned back to, i.e. a pointer variable. The reason you can use the arrays directly as input operands, is because input operands are expressions, not variables, and an expression that evaluates to an array is automatically converted to a pointer to first element of that array (but is still not an lvalue </languagelawyer>).
All in all then, with appropriate pointer variables for a and b, suitable operands and constraints for this code as-is would look more like this:
: [ina] "+r" (aptr), [inb] "+r" (bptr)
: [result] "r" (r)
: "d0", "d1", "memory" /* getting clobbers right is also important */
Side note: if you just want to get to grips with NEON instructions rather than fighting with GCC, intrinsics are an alternative to consider.

Can pretty variable names be used for registers in GCC inline assembly?

I have some inline assembly. I want GCC to have total freedom in choosing GP registers to allocate. I also want to use pretty names for registers inside the assembly for ease of comprehension for future maintainers. I think I did this previously (10+ years ago) for ARM 5te but am now scratching my head while writing some AArch64 code.
In a simpler example, this is what I want:
uint32_t arg1 = 1, arg2 = 2, result;
asm volatile(
"add %result, %arg1, %arg2\n"
// Outputs:
: ???
// Inputs:
: ???
// Clobbered:
: ???
);
I figure I need the right voodoo to go where I have written "???" above.
Is it possible?

Yes.
[arg1] "r" (arg1)
For example. The two names([arg1] and (arg1) above) can be different.
Inside the assembly code, you'd use:
add %[result], %[arg1], %[arg2]
Documentation link.
Here's your whole example reworked (case changed for the assembly variables just to illustrate that they needn't be the same):
uint32_t arg1 = 1, arg2 = 2, result;
asm volatile(
"add %[RESULT], %[ARG1], %[ARG2]\n"
: [RESULT]"=r"(result) /* output */
: [ARG1]"r"(arg1), [ARG2]"r"(arg2) /* inputs */
: /* no clobbers */
);

Retrieving the ZF in GCC inline assembly

I need to use some x86 instructions that have no GCC intrinsics, such as BSF and BSR.
With GCC inline assembly, I can write something like the following
__INTRIN_INLINE unsigned char bsf64(unsigned long* const index, const uint64_t mask)
{
__asm__("bsf %[mask], %[index]" : [index] "=r" (*index) : [mask] "mr" (mask));
return mask ? 1 : 0;
}
Code like if (bsf64(x, y)) { /* use x */ } is translated by GCC to something like
0x000000010001bf04 <bsf64+0>: bsf %rax,%rdx
0x000000010001bf08 <bsf64+4>: test %rax,%rax
0x000000010001bf0b <bsf64+7>: jne 0x10001bf44 <...>
However if mask is zero, BSF already sets the ZF flag, so the test after bsf is redundant.
Instead of returning mask ? 1 : 0, is it possible to retrieve the ZF flag and returning it, making GCC not generate the test?
EDIT: made the if example more clear
EDIT: In response to Damon, __builtin_ffsl generates even less optimal code. If I use the following code
int b = __builtin_ffsl(mask);
if (b) {
*index = b - 1;
return true;
} else {
return false;
}
GCC generates this assembly
0x000000000044736d <+1101>: bsf %r14,%r14
0x0000000000447371 <+1105>: cmove %r12,%r14
0x0000000000447375 <+1109>: add $0x1,%r14d
0x0000000000447379 <+1113>: je 0x4471c0 <...>
0x000000000044737f <+1119>: lea -0x1(%r14),%ecx
So the test is gone, but redundant conditional move, increment and decrement are generated.

A couple of remarks:
This is an "anti-optimization". You're trying to do a micro-optimization on something that the compiler already supports.
Your code does not generate the bsf instruction at all with my version of gcc with all optimization switches turned on. Looking at the code, that is not surprising, because you return mask, which is the source operand, not the destination operand (gcc uses AT&T syntax!). The compiler is intelligent enough to figure this out and drops the assembler code (which doesn't do anything) alltogether.
There is an intrinsic function __builtin_ffsl which does exactly the same as your inline assembly (though, correctly). An intrinsic is no less portable than inline assembler, but easier for the compiler to optimize.
Using the intrinsic function results in a bsf cmov sequence on my compiler (assuming the calling code forces it to actually emit the instruction), which shows that the compiler uses the zero-flag just fine without an additional test instruction.
Returning a char when you want a bool is not the best possible hint for the compiler, though it will probably figure it out anyway most of the time. However, telling the compiler to use a bitscan instruction when you are really only interested in "zero or not zero" is certainly sub-optimal. if(x) and if(!x) work perfectly well for that matter. It would be different if you returned the result as reference, so you could reuse it in another place, but as it is, your code is only a very complicated way of writing if(x).

Working with double-precision numbers in inline assembly (GCC, IA-32)

I'm just starting to learn assembly in my computer science class, and I have an assignment to round a floating-point value using a specified rounding mode. I've tried to implement this using fstcw, fldcw, and frndint. I modify the rounding control bits, round the number, and then restore the previous control bits (a requirement of the assignment).
The current outstanding problem is that the instruction fld %1 seems to load the wrong value into the st(0) floating-point register (for example, if I call the function with a value of 2.6207, the number -1.9427(...)e-29 gets loaded into the register). This may be due to a misuse of gcc's inline asm(), or something else, but I'm not sure why it happens.
Here's what I have:
double roundD (double n, RoundingMode roundingMode)
{
// control word storage (2 bytes for previous, 2 for current)
char *cw = malloc(4*sizeof(char));
char *cw2 = cw + 2;
asm("fstcw %3;" // store control word in cw
"mov %3,%4;" // copy control word into cw2
"and $0xF3FF,%4;" // zero out rounding control bits
"or %2,%4;" // put new mode into rounding control bits
"fldcw %5;" // load the modified control word
"fld %1;" // load n into st(0)
"frndint;" // round n
"fstp %0;" // load st(0) back into n
"fldcw %3;" // load the old control word from cw
: "=m" (n)
: "m" (n), "m" (roundingMode),
"m" (cw), "r" (cw2), "m" (cw2) // mov requires one argument in a register
);
free(cw);
return n;
}
I'd appreciate any pointers to what's wrong with that code, specifically relating to the fld %1 line and the asm inputs/outputs. (Of course, if you can find other problems, feel free to let me know about them as well.) I don't want anyone to do my homework for me, just point me in the right direction. Thanks!

At least one issue with your current code is it is using the single precision floating point versions of fld and fstp. If you replace them with fldl and fstpl it will probably work.

Here's what I've got. It's not tested, but hopefully would be less gnarly for you to work with. :-)
double
roundd(double n, short mode)
{
short cw, newcw;
__asm__("fstcw %w0" : "=m" (cw));
newcw = cw & 0xf3ff | mode;
__asm__("fldcw %w0" : : "m" (newcw));
__asm__("frndint" : "+t" (n));
__asm__("fldcw %w0" : : "m" (cw));
return n;
}
Although, if you're not required to use assembly to achieve your rounding mode, think about using the functions in <fenv.h> instead. :-)

As the sign changes, it means that the sign bit (which is the most significant, the first one) is not correct.
That suppose to me that the pointer %1 is wrongly aligned. If you have one byte, it can
begin on 0,1,2... but if you access two bytes, the address must be 0,2,4.... and in case
of double the address must be even dividable by 8: 0,8,16
So check if the address which you use to load the value is dividable by 8. Assembly has the align keyword to guarantee that your data is correctly aligned.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio