How to interact with RISC-V CSRs by using GCC C code? - gcc

this is my first question to be asked here on stackoverflow, so please be kind with me ;)
I'm new to RISC-V and low level C coding and I'm wondering how to manipulate the RISC-V CSRs using GCC C code.
A read of a specific CSR (e.g. MISA) looks easy: csrr rd, 0x301 which is short for csrrs rd, 0x301, x0 can be done e.g. with
int result;
asm("csrr %0, 0x301" : "=r"(result) : );
How can I convert the code above into some kind of function / callable unit with the following interface: int read_csr(int csr_number)?
Since the CSR number must be an immediate value in machine code, is it possible without generating the code on the fly (self modifying code)?
Thanks for your replies and discussion.
Joachim

You can use the I constraint for an immediate constant argument:
inline int read_csr(int csr_num) __attribute__((always_inline)) {
int result;
asm("csrr %0, %1" : "=r"(result) : "I"(csr_num));
return result; }
this will fail if you ever call it with an argument that is not a constant expression, as the I constraint requires a constant.

Related

Halide external function call from generator

I want to implement simple image processing routine quite similar to Auto Levels, so need to precalculate thresholds, make LUT and then make histogram stretching/normalization applying LUT.
But my question is not about algorithm side, it is about using extern defined functions, because i need a couple of while cycles for LUT calculation and i think using extern is good for it.
I tried following examples from Halide sources and checked this question too
I use AOT compilation currently testing on PC(winx64), aiming for arm in future, and have the following generator code:
Var x("x"), y("y");
Func make_a_root{ "make_a_root" };
Buffer<bitType> Lut{256, "lut"};
make_a_root(x, y) = inputY(x, y);
ExternFuncArgument arg = make_a_root;
Func g;
g.define_extern("generateAutoLevelsLut", { arg }, UInt(8), 2, Halide::NameMangling::CPlusPlus);
g.compute_root();
inputY has Input<Buffer<uint8_t>> inputY{ "input_y", 2 }; type
First i just want to make it run the call, so function body makes nothing but print (can i define function in same cpp file as generator?)
int generateAutoLevelsLut(halide_buffer_t * input, halide_buffer_t * out)
{
printf("\nextern call\n");
return 0;
}
I tried default mangling with extern "C" too.
Never succeeded getting print message though, so my question is, why this happenin. Is it just misunderstanding on some syntax or are there any problem with calling extern function from generator code?
EDIT:
Added usage of extern like 'out(x,y) = g(x,y)' (lvalue should be actually used!) , and it started to make a call. Now struggling with host == NULL. Digging into bounds inference stuff.
EDIT 2:
I added basic bounds inference checks, now it does not crash.. The next problem i have now, is: Is it possible to make call to external function, without actually influencing output result in direct manner?
Let me concretise what i mean.
The generator code looks like following:
Buffer<bitType> lut{256, "lut"};
args[0] = inputY;
args[1] = lut;
g.define_extern("generateAutoLevelsLut", args, { UInt(8) }, 2, Halide::NameMangling::C);
outputY(x, y) = g(x, y); // Call line
g.compute_root();
outputY.compute_root();
Extern functon code fills second input 'lut' with some dummy LUT:
Halide::Runtime::Buffer<uint16_t> im2Buffer(*input2);
Mat im2Mat(Size(im2Buffer.width(), im2Buffer.height()), CVC_8U, im2Buffer.data(), im2Buffer.stride(1));
for (int i = 0; i < 256; i++)
im2Mat.at<uchar>(i) = i;
And if i comment the 'Call line' in generator, it optimizes away the call to extern at all.
I want to make something like:
Func lutRoot;
lutRoot(x) = lut(x); // to convert from Buffer
outputY(x, y) = autoLevelsPrecalcLut(inputY, lutRoot)(x, y);
And here lut is implicitly passed into extern and filled there. But it doesn't work, as well as other variants which ignore the modification of 'output'... or maybe this whole approach is wrong?
Any suggestions? Thanks
EDIT 3:
Solved task avoiding extern calls, replacing while cycles with argmin and RDom combo, but original question about extern still remains
That should work (or fail with a linker error if it wasn't going to). It's possible the Halide pipeline doesn't think it needs to call your extern function. E.g. does something use the result?
Alternatively, try stderr instead, just in case it's an output stream buffering issue. That extern function definition is likely to cause Halide to error out (because it doesn't reply to the bounds inference query), and erroring out calls abort by default, which would swallow things printed to stdout.

Can evaluation of functions happen during compile time?

Consider the below function,
public static int foo(int x){
return x + 5;
}
Now, let us call it,
int in = /*Input taken from the user*/;
int x = foo(10); // ... (1)
int y = foo(in); // ... (2)
Here, can the compiler change
int x = foo(10); // ... (1)
to
int x = 15; // ... (1)
by evaluating the function call during compile time since the input to the function is available during compile time ?
I understand this is not possible during the call marked (2) because the input is available only during run time.
I do not want to know a way of doing it in any specific language. I would like to know why this can or can not be a feature of a compiler itself.
C++ does have a method for this:
Have a read up on the 'constexpr' keyword in C++11, it allows compile time evaluation of functions.
They have a limitation: the function must be a return statement (not multiple lines of code), but can call other constexpr functions (C++14 does not have this limitation AFAIK).
static constexpr int foo(int x){
return x + 5;
}
EDIT:
Why a compiler might not evaluate a function (just my guess):
It might not be appropriate to remove a function by evaluating it without being told.
The function could be used in different compilation units, and with static/dynamic inputs: thus evaluating it in some circumstances and adding a call in other places.
This use would provide inconsistent execution times (especially on a deterministic platform like AVR) where timing may be important, or at least need to be predictable.
Also interrupts (and how the compiler interacts with them) may come into play here.
EDIT:
constexpr is actually stronger -- it requires that the compiler do this. The compiler is free to fold away functions without constexpr, but the programmer can't rely on it doing so.
Can you give an example in the case where the user would have benefited from this but the compiler chose not to do it ?
inline functions may, or may not resolve to constant expressions which could be optimized into the end result.
However, a constexpr guarantees it. An inline function cannot be used as a compile time constant whereas constexpr can allow you to formulate compile time functions and more so, objects.
A basic example where constexpr makes a guarantee that inline cannot.
constexpr int foo( int a, int b, int c ){
return a+b+c;
}
int array[ foo(1, 2, 3) ];
And the same as a simple object.
struct Foo{
constexpr Foo( int a, int b, int c ) : val(a+b+c){}
int val;
};
constexpr Foo foo( 1,2,4 );
int array[ foo.val ];
Unless foo.val is a compile time constant, the code above will not compile.
Even as just a function, an inline function has no guarantee. And the linker can also do inlining over multiple compilation units, after the syntax has been compiled (array bounds checked for integer constants).
This is kind of like meta-programming, but without the templates. Of course these examples do not do the topic justice, however very complex solutions would benefit from the ability to use objects and functional programming to achieve a result.
Yes, evaluation can happen during compile time. This comes under the heading of constant folding and function inlining, both of which are common optimizations for optimizing compilers.
Many languages do not have strong distinction between "compile time" and "run time", but the general rule is that the language defines an "execution model" which defines the behavior of any particular program with any particular input (or specifies that it is undefined). The compiler must produce an executable that can read any input and produce the corresponding output as defined by the execution model. What happens inside the executable doesn't matter -- as long as the externally viewed behavior is correct.
Here "input", "output" and "behavior" includes all possible interactions with the environment that are defined in the execution model, including timing effects.

NEON inline assembly - store query

I am trying to learn how to utilize NEON using gcc and inline assembly.
While it is confusing and slow going, I making some progress (It's been 10 years since I last tried writing assembly).
My simple program loads a (small) vector, saturation sums it, and stores it. The problem I am having is that I cannot seem to store the result in the place I want.
When I use an unused array pointer (r) in my output list, I get an error "impossible constraint in asm". If I then create a second pointer to it (rptr), it assembles, but it re-uses an input register r2 which is a, effectively overwriting the input.
(I know my arrays are 32 elements in size and that I'm only processing one element, I plan to try to loop, or try load more registers for parallel processing next)
void vecSum()
{
//two input arrays of 32 bit types, one output
int32_t a[32];
int32_t b[32];
int32_t r[32];
//initialize
for(int cnt = 0; cnt < 32; cnt++)
{
a[cnt] = 0x33333333;
b[cnt] = 0x11111111;
r[cnt] = 0;
}
void *rptr = r;
__asm__ volatile(
"vld1.32 {d0},[%[ina]]!\n" //load the neon register with our data at a, post increment the reg
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n" //store the answer
: [result]"=r" (rptr) /*r*/
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
for(int g=0; g < 32; g++)
{
printf("0x[%d]%x ",g,a[g]);
}
}
Objdump:
for(int cnt = 0; cnt < 32; cnt++)
780: e3530080 cmp r3, #128 ; 0x80
784: 1afffff7 bne 768 <_Z8vecSum32v+0x28>
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n"
: [result]"=r" (rptr)
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
788: f422078f vld1.32 {d0}, [r2]
78c: f421178d vld1.32 {d1}, [r1]!
790: f2200011 vqadd.s32 d0, d0, d1
794: f402078f vst1.32 {d0}, [r2]
In summary, if I try vst1.32 d0,[%[result]] where result is the array pointer r, I get a compilation error. If I rptr ( another pointer to r) it comiles, but uses r2 (the array a) as the output.
Can anybody explain why I get the error outputting to r? And why the ptr to r is a?
rptr is declared as an output when it should be an input and "memory" is missing from the clobber list.
Alternatively you may put the arrays in structs and use the structs (rather than pointers) as arguments to the asm statement.
Consider if the asm contained add %[result], %[ina], %[inb]. There's no harm whatsoever in allocating r2 for both result and ina there. Since GCC doesn't go analysing the contents of the asm statement, its default assumption is that it contains a single instruction like that, so if yours is more complicated then you need to say so in order for things to work as expected.
Specifically, to prevent the problematic overlapping register allocation here, you need to be honest about the fact that you that your asm modifies the input registers - most simply via the + modifier (which then actually makes them outputs as far as GCC is concerned). Another unpleasant side effect of not doing that, is that the compiler would assume that e.g. r1 still holds the address of b afterwards, and may generate later code relying on that which will then go horribly wrong thanks to what the asm actually did.
Furthermore, you don't modify the result pointer, and only use its value as an input, so saying it's a write-only output operand is very wrong.
As for the issue with r, well, by specifying it as an output operand, you're saying that the asm writes a value back to that variable. Except you can't do that with an array variable in C (<languagelawyer> arrays are not modifiable lvalues) - you need to give the asm a variable which holds the address of the array and can be assigned back to, i.e. a pointer variable. The reason you can use the arrays directly as input operands, is because input operands are expressions, not variables, and an expression that evaluates to an array is automatically converted to a pointer to first element of that array (but is still not an lvalue </languagelawyer>).
All in all then, with appropriate pointer variables for a and b, suitable operands and constraints for this code as-is would look more like this:
: [ina] "+r" (aptr), [inb] "+r" (bptr)
: [result] "r" (r)
: "d0", "d1", "memory" /* getting clobbers right is also important */
Side note: if you just want to get to grips with NEON instructions rather than fighting with GCC, intrinsics are an alternative to consider.

Can pretty variable names be used for registers in GCC inline assembly?

I have some inline assembly. I want GCC to have total freedom in choosing GP registers to allocate. I also want to use pretty names for registers inside the assembly for ease of comprehension for future maintainers. I think I did this previously (10+ years ago) for ARM 5te but am now scratching my head while writing some AArch64 code.
In a simpler example, this is what I want:
uint32_t arg1 = 1, arg2 = 2, result;
asm volatile(
"add %result, %arg1, %arg2\n"
// Outputs:
: ???
// Inputs:
: ???
// Clobbered:
: ???
);
I figure I need the right voodoo to go where I have written "???" above.
Is it possible?
Yes.
[arg1] "r" (arg1)
For example. The two names([arg1] and (arg1) above) can be different.
Inside the assembly code, you'd use:
add %[result], %[arg1], %[arg2]
Documentation link.
Here's your whole example reworked (case changed for the assembly variables just to illustrate that they needn't be the same):
uint32_t arg1 = 1, arg2 = 2, result;
asm volatile(
"add %[RESULT], %[ARG1], %[ARG2]\n"
: [RESULT]"=r"(result) /* output */
: [ARG1]"r"(arg1), [ARG2]"r"(arg2) /* inputs */
: /* no clobbers */
);

Retrieving the ZF in GCC inline assembly

I need to use some x86 instructions that have no GCC intrinsics, such as BSF and BSR.
With GCC inline assembly, I can write something like the following
__INTRIN_INLINE unsigned char bsf64(unsigned long* const index, const uint64_t mask)
{
__asm__("bsf %[mask], %[index]" : [index] "=r" (*index) : [mask] "mr" (mask));
return mask ? 1 : 0;
}
Code like if (bsf64(x, y)) { /* use x */ } is translated by GCC to something like
0x000000010001bf04 <bsf64+0>: bsf %rax,%rdx
0x000000010001bf08 <bsf64+4>: test %rax,%rax
0x000000010001bf0b <bsf64+7>: jne 0x10001bf44 <...>
However if mask is zero, BSF already sets the ZF flag, so the test after bsf is redundant.
Instead of returning mask ? 1 : 0, is it possible to retrieve the ZF flag and returning it, making GCC not generate the test?
EDIT: made the if example more clear
EDIT: In response to Damon, __builtin_ffsl generates even less optimal code. If I use the following code
int b = __builtin_ffsl(mask);
if (b) {
*index = b - 1;
return true;
} else {
return false;
}
GCC generates this assembly
0x000000000044736d <+1101>: bsf %r14,%r14
0x0000000000447371 <+1105>: cmove %r12,%r14
0x0000000000447375 <+1109>: add $0x1,%r14d
0x0000000000447379 <+1113>: je 0x4471c0 <...>
0x000000000044737f <+1119>: lea -0x1(%r14),%ecx
So the test is gone, but redundant conditional move, increment and decrement are generated.
A couple of remarks:
This is an "anti-optimization". You're trying to do a micro-optimization on something that the compiler already supports.
Your code does not generate the bsf instruction at all with my version of gcc with all optimization switches turned on. Looking at the code, that is not surprising, because you return mask, which is the source operand, not the destination operand (gcc uses AT&T syntax!). The compiler is intelligent enough to figure this out and drops the assembler code (which doesn't do anything) alltogether.
There is an intrinsic function __builtin_ffsl which does exactly the same as your inline assembly (though, correctly). An intrinsic is no less portable than inline assembler, but easier for the compiler to optimize.
Using the intrinsic function results in a bsf cmov sequence on my compiler (assuming the calling code forces it to actually emit the instruction), which shows that the compiler uses the zero-flag just fine without an additional test instruction.
Returning a char when you want a bool is not the best possible hint for the compiler, though it will probably figure it out anyway most of the time. However, telling the compiler to use a bitscan instruction when you are really only interested in "zero or not zero" is certainly sub-optimal. if(x) and if(!x) work perfectly well for that matter. It would be different if you returned the result as reference, so you could reuse it in another place, but as it is, your code is only a very complicated way of writing if(x).

Resources