Working with double-precision numbers in inline assembly (GCC, IA-32) - gcc

I'm just starting to learn assembly in my computer science class, and I have an assignment to round a floating-point value using a specified rounding mode. I've tried to implement this using fstcw, fldcw, and frndint. I modify the rounding control bits, round the number, and then restore the previous control bits (a requirement of the assignment).
The current outstanding problem is that the instruction fld %1 seems to load the wrong value into the st(0) floating-point register (for example, if I call the function with a value of 2.6207, the number -1.9427(...)e-29 gets loaded into the register). This may be due to a misuse of gcc's inline asm(), or something else, but I'm not sure why it happens.
Here's what I have:
double roundD (double n, RoundingMode roundingMode)
{
// control word storage (2 bytes for previous, 2 for current)
char *cw = malloc(4*sizeof(char));
char *cw2 = cw + 2;
asm("fstcw %3;" // store control word in cw
"mov %3,%4;" // copy control word into cw2
"and $0xF3FF,%4;" // zero out rounding control bits
"or %2,%4;" // put new mode into rounding control bits
"fldcw %5;" // load the modified control word
"fld %1;" // load n into st(0)
"frndint;" // round n
"fstp %0;" // load st(0) back into n
"fldcw %3;" // load the old control word from cw
: "=m" (n)
: "m" (n), "m" (roundingMode),
"m" (cw), "r" (cw2), "m" (cw2) // mov requires one argument in a register
);
free(cw);
return n;
}
I'd appreciate any pointers to what's wrong with that code, specifically relating to the fld %1 line and the asm inputs/outputs. (Of course, if you can find other problems, feel free to let me know about them as well.) I don't want anyone to do my homework for me, just point me in the right direction. Thanks!

At least one issue with your current code is it is using the single precision floating point versions of fld and fstp. If you replace them with fldl and fstpl it will probably work.

Here's what I've got. It's not tested, but hopefully would be less gnarly for you to work with. :-)
double
roundd(double n, short mode)
{
short cw, newcw;
__asm__("fstcw %w0" : "=m" (cw));
newcw = cw & 0xf3ff | mode;
__asm__("fldcw %w0" : : "m" (newcw));
__asm__("frndint" : "+t" (n));
__asm__("fldcw %w0" : : "m" (cw));
return n;
}
Although, if you're not required to use assembly to achieve your rounding mode, think about using the functions in <fenv.h> instead. :-)

As the sign changes, it means that the sign bit (which is the most significant, the first one) is not correct.
That suppose to me that the pointer %1 is wrongly aligned. If you have one byte, it can
begin on 0,1,2... but if you access two bytes, the address must be 0,2,4.... and in case
of double the address must be even dividable by 8: 0,8,16
So check if the address which you use to load the value is dividable by 8. Assembly has the align keyword to guarantee that your data is correctly aligned.

Related

How to change a boost::multiprecision::cpp_int from big endian to little endian

I have a boost::multiprecision::cpp_int in big endian and have to change it to little endian. How can I do that? I tried with boost::endian::conversion but that did not work.
boost::multiprecision::cpp_int bigEndianInt("0xe35fa931a0000*);
boost::multiprecision::cpp_int littleEndianInt;
littleEndianIn = boost::endian::endian_reverse(m_cppInt);
The memory layout of boost multi-precision types is implementation detail. So you cannot assume much about it anyways (they're not supposed to be bitwise serializable).
Just read a random section of the docs:
MinBits
Determines the number of Bits to store directly within the object before resorting to dynamic memory allocation. When zero, this field is determined automatically based on how many bits can be stored in union with the dynamic storage header: setting a larger value may improve performance as larger integer values will be stored internally before memory allocation is required.
It's not immediately clear that you have any chance at some level of "normal int behaviour" in memory layout. The only exception would be when MinBits==MaxBits.
Indeed, we can static_assert that the size of cpp_int with such backend configs match the corresponding byte-sizes.
It turns out that there's even a promising tag in the backend base-class to indicate "triviality" (this is truly promising): trivial_tag, so let's use it:
Live On Coliru
#include <boost/multiprecision/cpp_int.hpp>
namespace mp = boost::multiprecision;
template <int bits> using simple_be =
mp::cpp_int_backend<bits, bits, mp::unsigned_magnitude>;
template <int bits> using my_int =
mp::number<simple_be<bits>, mp::et_off>;
using my_int8_t = my_int<8>;
using my_int16_t = my_int<16>;
using my_int32_t = my_int<32>;
using my_int64_t = my_int<64>;
using my_int128_t = my_int<128>;
using my_int192_t = my_int<192>;
using my_int256_t = my_int<256>;
template <typename Num>
constexpr bool is_trivial_v = Num::backend_type::trivial_tag::value;
int main() {
static_assert(sizeof(my_int8_t) == 1);
static_assert(sizeof(my_int16_t) == 2);
static_assert(sizeof(my_int32_t) == 4);
static_assert(sizeof(my_int64_t) == 8);
static_assert(sizeof(my_int128_t) == 16);
static_assert(is_trivial_v<my_int8_t>);
static_assert(is_trivial_v<my_int16_t>);
static_assert(is_trivial_v<my_int32_t>);
static_assert(is_trivial_v<my_int64_t>);
static_assert(is_trivial_v<my_int128_t>);
// however it doesn't scale
static_assert(sizeof(my_int192_t) != 24);
static_assert(sizeof(my_int256_t) != 32);
static_assert(not is_trivial_v<my_int192_t>);
static_assert(not is_trivial_v<my_int256_t>);
}
Conluding: you can have trivial int representation up to a certain point, after which you get the allocator-based dynamic-limb implementation no matter what.
Note that using unsigned_packed instead of unsigned_magnitude representation never leads to a trivial backend implementation.
Note that triviality might depend on compiler/platform choices (it's likely that cpp_128_t uses some builtin compiler/standard library support on GCC, e.g.)
Given this, you MIGHT be able to pull of what you wanted to do with hacks IF your backend configuration support triviality. Sadly I think it requires you to manually overload endian_reverse for 128 bits case, because the GCC builtins do not have __builtin_bswap128, nor does Boost Endian define things.
I'd suggest working off the information here How to make GCC generate bswap instruction for big endian store without builtins?
Final Demo (not complete)
#include <boost/multiprecision/cpp_int.hpp>
#include <boost/endian/buffers.hpp>
namespace mp = boost::multiprecision;
namespace be = boost::endian;
template <int bits> void check() {
using T = mp::number<mp::cpp_int_backend<bits, bits, mp::unsigned_magnitude>, mp::et_off>;
static_assert(sizeof(T) == bits/8);
static_assert(T::backend_type::trivial_tag::value);
be::endian_buffer<be::order::big, T, bits, be::align::no> buf;
buf = T("0x0102030405060708090a0b0c0d0e0f00");
std::cout << std::hex << buf.value() << "\n";
}
int main() {
check<128>();
}
(Changing be::order::big to be::order::native obviously makes it compile. The other way to complete it would be to have an ADL accessible overload for endian_reverse for your int type.)
This is both trivial and in the general case unanswerable, let me explain:
For a general N-bit integer, where N is a large number, there is unlikely to be any well defined byte order, indeed even for 64 and 128 bit integers there are more than 2 possible orders in use: https://en.wikipedia.org/wiki/Endianness#Middle-endian.
On any platform, with any native endianness you can always extract the bytes of a cpp_int, the first example here: https://www.boost.org/doc/libs/1_73_0/libs/multiprecision/doc/html/boost_multiprecision/tut/import_export.html#boost_multiprecision.tut.import_export.examples shows you how. When exporting bytes like this, they are always most significant byte first, so you can subsequently rearrange them how you wish. You should not however, rearrange them and load them back into a cpp_int as the class won't know what to do with the result!
If you know that the value is small enough to fit into a native integer type, then you can simply cast to the native integer and use a system API on the result. As in endian_reverse(static_cast<int64_t>(my_cpp_int)). Again, don't assign the result back into a cpp_int as it requires native byte order.
If you wish to check whether a value is small enough to fit in an N-bit integer for the approach above, you can use the msb function, which returns the index of the most significant bit in the cpp_int, add one to that to obtain the number of bits used, and filter out the zero case and the code looks like:
unsigned bits_used = my_cpp_int.is_zero() ? 0 : msb(my_cpp_int) + 1;
Note that all of the above use completely portable code - no hacking of the underlying implementation is required.

Analyze store instruction consisting inttoptr in LLVM

I am trying to analyze a byte code consisting of a store instruction with inttoptr. I am having trouble to detect whether a store instruction has the inttoptr value as the value operand (3rd instruction in the following code in entry BB). My opcode looks like the following:
define dso_local i32 #test(i32* %p) #0 {
entry:
%p.addr = alloca i32*, align 8
store i32* %p, i32** %p.addr, align 8
store i32* inttoptr (i64 1000 to i32*), i32** %p.addr, align 8
%0 = load i32*, i32** %p.addr, align 8
%1 = load i32, i32* %0, align 4
ret i32 %1
}
I am trying to analyze the store instruction and trying to find whether inttoptr is in a store instruction by using classof method and with dyn_cast like the following code:
StoreInst *store = dyn_cast<StoreInst>(I);
Value *vv = store->getValueOperand();
Value *vp = store->getPointerOperand();
if(IntToPtrInst::classof(vv)){
outs() << "Inttoptr found\n";
}
if(Instruction *inp = dyn_cast<IntToPtrInst>(vv)){
outs() << "Inttoptr found\n";
}
It seems I am not being able to detect inttoptr with any of the methods. I know the byte code is not creating a separate instruction for the inttoptr but it is merging with the store instruction. It would be really nice if anyone points me what I am missing and how I can detect the inttoptr in a store instruction.
The cast you're interested in is not an instruction, but rather a constant cast from the constant integer 1000 to a pointer. You can detect it using a test like isa<ConstantExpr>(foo->getPointerOperand()) && cast<ConstantExpr>(foo->getPointerOperand())->getOpcode() == ConstantExpr::IntToPtrCast, but I typed that from memory and I'm sure there are typos.
When you read IR, instructions are always on their own line, while constants are inline as arguments or initialisers, even the quite complex constants produced using ConstantExpr.

Too large const on Arduino UNO

I'm trying to execute an algorithm on an Arduino UNO, it needs const table with some larges numbers and sometimes, I get overflow values. This is the case for this number : 628331966747.0
Okay, this is a big one, but its type is float (32 bit) where maximum is 3.4028235e38. So it should work, theoretically ?
What can I do against this ? Do you know a solution ?
EDIT : On Arduino UNO, double are exaclty the same type that floats (32 bits)
Here is a code that leads to the error :
float A;
void setup() {
A = 628331966747.0;
Serial.begin(9600);
}
void loop() {
Serial.println(A);
delay(1000);
}
it print "ovf, ovf, ..., ovf"
There is nothing wrong with the constant itself (except for its rather optimistic number of significant figures), but the problem is with the implementation of the Arduino's library support for printing floating point values. Print::printFloat() contains the following pre-condition tests:
if (isnan(number)) return print("nan");
if (isinf(number)) return print("inf");
if (number > 4294967040.0) return print ("ovf"); // constant determined empirically
if (number <-4294967040.0) return print ("ovf"); // constant determined empirically
It seems that the range of printable values is deliberately restricted in order presumably to reduce complexity and code size. The subsequent code reveals why:
// Extract the integer part of the number and print it
unsigned long int_part = (unsigned long)number;
double remainder = number - (double)int_part;
n += print(int_part);
The somewhat simplistic implementation requires that the absolute value of the integer part is itself a 32bit integer.
The worrying thing perhaps is the comment "constant determined empirically" which rather suggests that the values were arrived at by trial and error rather then an understanding of the mathematics! One has to wonder why these values are not defined in terms of INT_UMAX.
There is a proposed "fix" described here, but it will not work at least because it applies the integer abs() function to the double parameter number, which will only work if the integer part is less than the even more restrictive MAX_INT. The author has posted a link to a zip file containing a fix that looks more likely to work (there is evidence at least of testing!).

NEON inline assembly - store query

I am trying to learn how to utilize NEON using gcc and inline assembly.
While it is confusing and slow going, I making some progress (It's been 10 years since I last tried writing assembly).
My simple program loads a (small) vector, saturation sums it, and stores it. The problem I am having is that I cannot seem to store the result in the place I want.
When I use an unused array pointer (r) in my output list, I get an error "impossible constraint in asm". If I then create a second pointer to it (rptr), it assembles, but it re-uses an input register r2 which is a, effectively overwriting the input.
(I know my arrays are 32 elements in size and that I'm only processing one element, I plan to try to loop, or try load more registers for parallel processing next)
void vecSum()
{
//two input arrays of 32 bit types, one output
int32_t a[32];
int32_t b[32];
int32_t r[32];
//initialize
for(int cnt = 0; cnt < 32; cnt++)
{
a[cnt] = 0x33333333;
b[cnt] = 0x11111111;
r[cnt] = 0;
}
void *rptr = r;
__asm__ volatile(
"vld1.32 {d0},[%[ina]]!\n" //load the neon register with our data at a, post increment the reg
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n" //store the answer
: [result]"=r" (rptr) /*r*/
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
for(int g=0; g < 32; g++)
{
printf("0x[%d]%x ",g,a[g]);
}
}
Objdump:
for(int cnt = 0; cnt < 32; cnt++)
780: e3530080 cmp r3, #128 ; 0x80
784: 1afffff7 bne 768 <_Z8vecSum32v+0x28>
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n"
: [result]"=r" (rptr)
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
788: f422078f vld1.32 {d0}, [r2]
78c: f421178d vld1.32 {d1}, [r1]!
790: f2200011 vqadd.s32 d0, d0, d1
794: f402078f vst1.32 {d0}, [r2]
In summary, if I try vst1.32 d0,[%[result]] where result is the array pointer r, I get a compilation error. If I rptr ( another pointer to r) it comiles, but uses r2 (the array a) as the output.
Can anybody explain why I get the error outputting to r? And why the ptr to r is a?
rptr is declared as an output when it should be an input and "memory" is missing from the clobber list.
Alternatively you may put the arrays in structs and use the structs (rather than pointers) as arguments to the asm statement.
Consider if the asm contained add %[result], %[ina], %[inb]. There's no harm whatsoever in allocating r2 for both result and ina there. Since GCC doesn't go analysing the contents of the asm statement, its default assumption is that it contains a single instruction like that, so if yours is more complicated then you need to say so in order for things to work as expected.
Specifically, to prevent the problematic overlapping register allocation here, you need to be honest about the fact that you that your asm modifies the input registers - most simply via the + modifier (which then actually makes them outputs as far as GCC is concerned). Another unpleasant side effect of not doing that, is that the compiler would assume that e.g. r1 still holds the address of b afterwards, and may generate later code relying on that which will then go horribly wrong thanks to what the asm actually did.
Furthermore, you don't modify the result pointer, and only use its value as an input, so saying it's a write-only output operand is very wrong.
As for the issue with r, well, by specifying it as an output operand, you're saying that the asm writes a value back to that variable. Except you can't do that with an array variable in C (<languagelawyer> arrays are not modifiable lvalues) - you need to give the asm a variable which holds the address of the array and can be assigned back to, i.e. a pointer variable. The reason you can use the arrays directly as input operands, is because input operands are expressions, not variables, and an expression that evaluates to an array is automatically converted to a pointer to first element of that array (but is still not an lvalue </languagelawyer>).
All in all then, with appropriate pointer variables for a and b, suitable operands and constraints for this code as-is would look more like this:
: [ina] "+r" (aptr), [inb] "+r" (bptr)
: [result] "r" (r)
: "d0", "d1", "memory" /* getting clobbers right is also important */
Side note: if you just want to get to grips with NEON instructions rather than fighting with GCC, intrinsics are an alternative to consider.

what's meaning of f in "js 2f\n\t"?

the codes:
extern inline int strncmp(const char * cs, const char * ct, int count)
{
register int __res;
__asm__("cld\n"
"1:\tdecl %3\n\t"
"js 2f\n\t"
"lodsb\n\t"
"scasb\n\t"
"jne 3f\n\t"
"testb %%al, %%al\n\t"
"jne 1b\n"
"2:\txorl %%eax,%%eax\n\t"
"jmp 4f\n"
"3:\tmovl $1,%%eax\n\t"
"j1 4f\n\t"
"negl %%eax\n"
"4:"
:"=a" (__res):"D" (cs), "S" (ct), "c" (count):"si","di","cx");
return __res;
}
I don't understand the f in "js 2f\n\t" and the b in "jne 1b\n", How to
understand this ? which book I should look? Thank you.
In this context f means forward and b means backward. So js 2f means jump forward to label 2, if sign set.
You'll want to look into gcc inline assembly. I can't seem to find any reference online to include this bit, but I know you can find it in Professional Assembly Language.
Why can't we use named labels ? To quote from the book:
If you have another asm section in your C code, you cannot use the
same labels again, or an error message will result due to duplicate
use of labels.
So what can we do ?
The solution is to use local labels. Both conditional and
unconditional branches allow you to specify a number as a label, along
with a directional flag to indicate which way the processor should
look for the numerical label. The first occurrence of the label found
will be taken.
About modifiers:
Use the f modifier to indicate the label is forward from the jump
instruction. To move backward, you must use the b modifier.
This is documented in the manual for the assembler.

Resources