Linker error while calling c function from Arm thumb2 - gcc

I am having trouble calling a c function from arm assembly. Vice versa works fine. Arch is cortex-m3 and the board is due. Compiler is gcc.
Here's the assembly code:
.syntax unified
.section .text
.thumb_func
.cpu cortex-m3
.extern my_c_add
.global call_my_c_add
call_my_c_add: # r0 - x, r1 - y
bl my_c_add
bx lr # return
And here's the c code:
#include <Arduino.h>
#include <SPI.h>
#include <Ethernet.h>
extern "C" unsigned int call_my_c_add (unsigned int, unsigned int);
unsigned int my_c_add(unsigned int, unsigned int);
unsigned int x=20;
unsigned int y = 15;
void setup()
{
Serial.begin(115200);
Serial.println("exiting setup");
}
void loop()
{
unsigned int z = 0;
z = call_my_c_add (x, y);
Serial.print("c calling asm calling c, addition is - ");
Serial.println(z);
}
unsigned int my_c_add(unsigned int x, unsigned int y)
{
return (x+y);
}
The error I get is -
small_sample.S.o: In function call_my_c_add':
small_sample.S:12: undefined reference tomy_c_add'
collect2: ld returned 1 exit status
Here's the command I use for linking -
arm-none-eabi-g++ -O3 -Wl,--gc-sections -mcpu=cortex-m3 -T flash.ld -Wl,-Map,mapfile -o elffile -L somefile -lm -lgcc -mthumb -Wl,--cref -Wl,--check-sections -Wl,--gc-sections -Wl,--entry=Reset_Handler -Wl,--unresolved-symbols=report-all -Wl,--warn-common -Wl,--warn-section-align -Wl,--start-group some.c.o some2.cpp.o assembly.S.o somelib.a -Wl,--end-group

g++ compiler does some name mangling. You probably need to add extern "C" also on the my_c_add, to disable it for that function.
Try to run arm-none-eabi-nm on the two object files, and check that the name of the symbol defined in the object compiled from C/C++ is the same as the symbol in the object compiled from assembly.

Related

gcc why does -Woverflow not catch int8_t i = 128;?

edit: it's probably a gcc bug. reported: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108798
int8_t has a range of -128 to 127, so when compiling the code
#include <stdint.h>
int main(){
int8_t i = 128;
(void)i;
}
with
gcc t.c -Woverflow
why does it not catch the 128 overflow? fwiw gcc catches the overflow only when you go over 255, for example
#include <stdint.h>
int main(){
int8_t i = 256;
(void)i;
}
is caught. so why isn't 128 caught?
interestingly, it catches the 128 overflow with -Woverflow -Wpedantic , resulting in a
prog.c:3:15: warning: overflow in conversion from 'int' to 'int8_t' {aka 'signed char'} changes value from '128' to '-128' [-Woverflow]
3 | int8_t ii=128;
tested with gcc 12.1.0 (released 2022-05-06)

Can gcc or clang inline functions that are not in the same compilation unit?

Say I have a function with the inline keyword in a compilation unit.
If I have
// math.h
inline int sum(int x, int y);
and
// math.c
inline int sum(int x, int y)
{
return x + y;
}
and
// main.c
#include "math.h"
int main(int argc, char **argv)
{
return sum(argc,argc);
}
And building with
gcc -O3 -c math.c -o math.o
gcc -O3 -c main.c -o main.o
gcc math.o main.o
Will an optimizing compiler inline sum? Can gcc or clang inline functions from other compilation units?
GCC can (and often will) inline functions from different TUs when you compile with LTO enabled. For this you need to add -flto to CFLAGS/CXXFLAGS and LDFLAGS.

OpenMP pragma translation to runtime calls

I wrote a short program in C with OpenMP pragma, and I need to know to which libGOMP function a pragma is translated by GCC.
Here is my marvelous code:
#include <stdio.h>
#include "omp.h"
int main(int argc, char** argv)
{
int k = 0;
#pragma omp parallel private(k) num_threads(4)
{
k = omp_get_thread_num();
printf("Hello World from %d !\n", k);
}
return 0;
}
In order to generate intermediate language from GCC v8.2.0, I compiled this program with the following command:
gcc -fopenmp -o hello.exe hello.c -fdump-tree-ompexp
And the result is given by:
;; Function main (main, funcdef_no=0, decl_uid=2694, cgraph_uid=0, symbol_order=0)
OMP region tree
bb 2: gimple_omp_parallel
bb 3: GIMPLE_OMP_RETURN
Added new low gimple function main._omp_fn.0 to callgraph
Introduced new external node (omp_get_thread_num/2).
Introduced new external node (printf/3).
;; Function main._omp_fn.0 (main._omp_fn.0, funcdef_no=1, decl_uid=2700, cgraph_uid=1, symbol_order=1)
main._omp_fn.0 (void * .omp_data_i)
{
int k;
<bb 6> :
<bb 3> :
k = omp_get_thread_num ();
printf ("Hello World from %d !\n", k);
return;
}
;; Function main (main, funcdef_no=0, decl_uid=2694, cgraph_uid=0, symbol_order=0)
Merging blocks 2 and 7
Merging blocks 2 and 4
main (int argc, char * * argv)
{
int k;
int D.2698;
<bb 2> :
k = 0;
__builtin_GOMP_parallel (main._omp_fn.0, 0B, 4, 0);
D.2698 = 0;
<bb 3> :
<L0>:
return D.2698;
}
The function call to "__builtin_GOMP_parallel" is what it interest me. So, I looked at the source code of the libGOMP from GCC.
However, the only function calls I found was (from parallel.c file):
GOMP_parallel_start (void (*fn) (void *), void *data, unsigned num_threads)
GOMP_parallel_end (void)
So, I can imiagine that, in a certain manner, the call to "__builtin_GOMP_parallel" is transformed to GOMP_parallel_start and GOMP_parallel_end.
How can I be sure of this assumption ? How can I found the translation from the builtin function to the two other ones I found in the source code ?
Thank you
You almost got it. __builtin_GOMP_parallel is just a compiler alias to GOMP_parallel (defined in omp-builtins.def) which is translated very late in compilation, you can see the actual call in the assembly with gcc -S.
GOMP_parallel is similar to
GOMP_parallel_start(...);
fn(...);
GOMP_parallel_end();

c - Dynamically linking a PGI OpenACC-enabled library with gcc

Previously, I asked a question regarding the creation of a static library with PGI and linking it to a program that is built with gcc: c - Linking a PGI OpenACC-enabled library with gcc
Now, I have the same question but dynamically. How can I built a program with gcc while my library is dynamically built with PGI?
And also, considering following facts:
I want both of them to recognize same OpenMP pragma and routines too. For instance, when I use OpenMP critical regions in the library, the whole program should be serialized at that section.
OpenACC pragmas are used in the library that was built with PGI.
Load library completely dynamic in my application. I mean using dlopen to open lib and dlsym to find functions.
I also want my threads to be able to simultaneously access GPU for data tranfer and/or computations. For more details see following code snippets.
For instance, building following lib and main code emits this error: call to cuMemcpyHtoDAsync returned error 1: Invalid value
Note: When building following codes, I intentionally used LibGOMP (-lgomp) instead of PGI's OpenMP library (-lpgmp) for both cases, lib and main.
Lib code:
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
#include <omp.h>
double calculate_sum(int n, double *a) {
double sum = 0;
int i;
#pragma omp critical
{
printf("Num devices: %d\n", acc_get_num_devices(acc_device_nvidia));
#pragma acc enter data copyin(a[0:n])
#pragma acc parallel
#pragma acc loop
for(i=0;i<n;i++) {
sum += a[i];
}
#pragma acc exit data delete(a[0:n])
}
return sum;
}
int ret_num_dev(int index) {
int dev = acc_get_num_devices(acc_device_nvidia);
if(dev == acc_device_nvidia)
printf("Num devices: %d - Current device: %d\n", dev, acc_get_device());
return dev;
}
Built library with following commands:
pgcc -acc -ta=nvidia:nordc -fPIC -c libmyacc.c
pgcc -shared -Wl,-soname,libctest.so.1 -o libmyacc.so -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc libmyacc.o
Main code:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <dlfcn.h>
#define N 1000
// to make sure library is loaded just once for whole program
static void *lib_handle = NULL;
static int lib_loaded = 0;
static double (*calculate_sum2)(int , double *);
void call_lib_so() {
// load library just once and init the function pointer
// to function in the library.
if(lib_loaded == 0) {
lib_loaded = 1;
char *error;
lib_handle = dlopen("/home/millad/temp/gcc-pgi/libmyacc.so", RTLD_NOW);
if (!lib_handle) {
fprintf(stderr, "%s\n", dlerror());
exit(1);
}
calculate_sum2 = (double (*)(int , double *)) dlsym(lib_handle, "calculate_sum");
if ((error = dlerror()) != NULL) {
fprintf(stderr, "%s\n", error);
exit(1);
}
}
// execute the function per call
int n = N, i;
double *a = (double *) malloc(sizeof(double) * n);
for(i=0;i<n;i++)
a[i] = 1.0 * i;
double sum = (*calculate_sum2)(n, a);
free(a);
printf("-------- SUM: %.3f\n", sum);
// dlclose(lib_handle);
}
extern double calculate_sum(int n, double *a);
int main() {
// allocation and initialization of an array
double *a = (double*) malloc(sizeof(double) * N);
int i;
for(i=0;i<N;i++) {
a[i] = (i+1) * 1.0;
}
// access and run OpenACC region with all threads
#pragma omp parallel
call_lib_so();
return 0;
}
And built my main code with following command using gcc as described by Mat in my previous question:
gcc f1.c -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc
Am I doing something wrong? Are above steps correct?
Your code works correctly for me. I tried to use what you listed but needed to remove the "libctest.so", change the location where dlopen gets the so, and add "-DN=1024" on the gcc compilation line. After that, it compiled and ran fine.
% pgcc -acc -ta=nvidia:nordc -fPIC -c libmyacc.c -V16.5
% pgcc -shared -o libmyacc.so -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc libmyacc.o -V16.5
% gcc f1.c -L/proj/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc -DN=1024
% ./a.out
Num devices: 8
-------- SUM: 523776.000

GCC Vector Extensions Sqrt

I am currently experimenting with the GCC vector extensions. However, I am wondering how to go about getting sqrt(vec) to work as expected.
As in:
typedef double v4d __attribute__ ((vector_size (16)));
v4d myfunc(v4d in)
{
return some_sqrt(in);
}
and at least on a recent x86 system have it emit a call to the relevant intrinsic sqrtpd. Is there a GCC builtin for sqrt that works on vector types or does one need to drop down to the intrinsic level to accomplish this?
Looks like it's a bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54408 I don't know of any workaround other than do it component-wise. The vector extensions were never meant to replace platform specific intrinsics anyway.
Some funky code to this effect:
#include <cmath>
#include <utility>
template <::std::size_t...> struct indices { };
template <::std::size_t M, ::std::size_t... Is>
struct make_indices : make_indices<M - 1, M - 1, Is...> {};
template <::std::size_t... Is>
struct make_indices<0, Is...> : indices<Is...> {};
typedef float vec_type __attribute__ ((vector_size(4 * sizeof(float))));
template <::std::size_t ...Is>
vec_type sqrt_(vec_type const& v, indices<Is...> const)
{
vec_type r;
::std::initializer_list<int>{(r[Is] = ::std::sqrt(v[Is]), 0)...};
return r;
}
vec_type sqrt(vec_type const& v)
{
return sqrt_(v, make_indices<4>());
}
int main()
{
vec_type v;
return sqrt(v)[0];
}
You could also try your luck with auto-vectorization, which is separate from the vector extension.
You can loop over the vectors directly
#include <math.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d myfunc(v2d in) {
v2d out;
for(int i=0; i<2; i++) out[i] = sqrt(in[i]);
return out;
}
The sqrt function has to trap for signed zero and NAN but if you avoid these with -Ofast both Clang and GCC produce simply sqrtpd.
https://godbolt.org/g/aCuovX
GCC might have a bug because I had to loop to 4 even though there are only 2 elements to get optimal code.
But with AVX and AVX512 GCC and Clang are ideal
AVX
https://godbolt.org/g/qdTxyp
AVX512
https://godbolt.org/g/MJP1n7
My reading of the question is that you want the square root of 4 packed double precision values... that's 32 bytes. Use the appropriate AVX intrinsic:
#include <x86intrin.h>
typedef double v4d __attribute__ ((vector_size (32)));
v4d myfunc (v4d v) {
return _mm256_sqrt_pd(v);
}
x86-64 gcc 10.2 and x86-64 clang 10.0.1
using -O3 -march=skylake :
myfunc:
vsqrtpd %ymm0, %ymm0 # (or just `ymm0` for Intel syntax)
ret
ymm0 is the return value register.
That said, it just so happens there is a builtin: __builtin_ia32_sqrtpd256, which doesn't require the intrinsics header. I would definitely discourage its use however.

Resources