Raspberry PI4 Segmentation Fault - gcc

I have a short program that is causing Segmentation Fault on RPi4 after run several times (e.g.: 10 times in a loop).
I am using Raspbian GNU/Linux 10 (buster) and default gcc compiler (sudo apt install build-essential)
gcc --version
gcc (Raspbian 8.3.0-6+rpi1) 8.3.0
Do you think this is a gcc compiler problem? Maybe I am missing some special settings for RPi4.
I am using this to build:
gcc threads.c -o threads -l pthread
The output is sometimes (not always) something like this:
...
in thread_dummy, loop: 003
Segmentation fault
The code is here:
#include <stdio.h> /* for puts() */
#include <unistd.h> /* for sleep() */
#include <stdlib.h> /* for EXIT_SUCCESS */
#include <pthread.h>
#define PTR_SIZE (0xFFFFFF)
#define PTR_CNT (10)
void* thread_dummy(void* param)
{
void* ptr = malloc(PTR_SIZE);
//fprintf(stderr, "thread num: %03i, stack: %08X, heap: %08X - %08X\n", (int)param, (unsigned int)&param, (unsigned int)ptr, (unsigned int)((unsigned char*)ptr + PTR_SIZE));
fprintf(stderr, "in thread_dummy, loop: %03i\n", (int)param);
sleep(1);
free(ptr);
pthread_detach(pthread_self());
return NULL;
}
int main(void)
{
void* ptrs[PTR_CNT];
pthread_t threads[PTR_CNT];
for(int i=0; i<PTR_CNT; ++i)
{
ptrs[i] = malloc(PTR_SIZE);
//fprintf(stderr, "main num: %03i, stack: %08X, heap: %08X - %08X\n", i, (unsigned int)&ptrs, (unsigned int)ptrs[i], (unsigned int)((unsigned char*)ptrs[i] + PTR_SIZE));
fprintf(stderr, "in main, loop: %03i\n", i);
}
fprintf(stderr, "-----------------------------------------------------------\n");
for(int i=0; i<PTR_CNT; ++i)
pthread_create(&threads[i], 0, thread_dummy, (void*)i);
for(int i=0; i<PTR_CNT; ++i)
pthread_join(threads[i], NULL);
for(int i=0; i<PTR_CNT; ++i)
free(ptrs[i]);
return EXIT_SUCCESS;
}
UPDATE:
I also tested it with new gcc, but the problem remains...
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/arm-linux-gnueabihf/11.1.0/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../configure --enable-languages=c,c++,fortran --with-cpu=cortex-a72 --with-fpu=neon-fp-armv8 --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.1.0 (GCC)

pthread_create is like malloc, and pthread_detach or pthread_join is like free. You are basically doing something like "double free" - you detach a thread and join it at the same time. Either detach or join the thread.
You could remove pthread_join from main. But you should logically remove pthread_detach(...) from inside the thread, which is actually useless because the thread terminates right after anyway.

Related

OpenMP pragma translation to runtime calls

I wrote a short program in C with OpenMP pragma, and I need to know to which libGOMP function a pragma is translated by GCC.
Here is my marvelous code:
#include <stdio.h>
#include "omp.h"
int main(int argc, char** argv)
{
int k = 0;
#pragma omp parallel private(k) num_threads(4)
{
k = omp_get_thread_num();
printf("Hello World from %d !\n", k);
}
return 0;
}
In order to generate intermediate language from GCC v8.2.0, I compiled this program with the following command:
gcc -fopenmp -o hello.exe hello.c -fdump-tree-ompexp
And the result is given by:
;; Function main (main, funcdef_no=0, decl_uid=2694, cgraph_uid=0, symbol_order=0)
OMP region tree
bb 2: gimple_omp_parallel
bb 3: GIMPLE_OMP_RETURN
Added new low gimple function main._omp_fn.0 to callgraph
Introduced new external node (omp_get_thread_num/2).
Introduced new external node (printf/3).
;; Function main._omp_fn.0 (main._omp_fn.0, funcdef_no=1, decl_uid=2700, cgraph_uid=1, symbol_order=1)
main._omp_fn.0 (void * .omp_data_i)
{
int k;
<bb 6> :
<bb 3> :
k = omp_get_thread_num ();
printf ("Hello World from %d !\n", k);
return;
}
;; Function main (main, funcdef_no=0, decl_uid=2694, cgraph_uid=0, symbol_order=0)
Merging blocks 2 and 7
Merging blocks 2 and 4
main (int argc, char * * argv)
{
int k;
int D.2698;
<bb 2> :
k = 0;
__builtin_GOMP_parallel (main._omp_fn.0, 0B, 4, 0);
D.2698 = 0;
<bb 3> :
<L0>:
return D.2698;
}
The function call to "__builtin_GOMP_parallel" is what it interest me. So, I looked at the source code of the libGOMP from GCC.
However, the only function calls I found was (from parallel.c file):
GOMP_parallel_start (void (*fn) (void *), void *data, unsigned num_threads)
GOMP_parallel_end (void)
So, I can imiagine that, in a certain manner, the call to "__builtin_GOMP_parallel" is transformed to GOMP_parallel_start and GOMP_parallel_end.
How can I be sure of this assumption ? How can I found the translation from the builtin function to the two other ones I found in the source code ?
Thank you
You almost got it. __builtin_GOMP_parallel is just a compiler alias to GOMP_parallel (defined in omp-builtins.def) which is translated very late in compilation, you can see the actual call in the assembly with gcc -S.
GOMP_parallel is similar to
GOMP_parallel_start(...);
fn(...);
GOMP_parallel_end();

__seg_fs on GCC. Is it possible to emulate it just in a program?

I've just read about support for %fs and %gs segment prefixes on the Intel platforms in GCC.
It was mentioned that "The way you obtain %gs-based pointers, or control the
value of %gs itself, is out of the scope of gcc;"
I'm looking for a way when I manually can set the value of %fs (I'm on IA32, RH Linux) and work with it. When I just set %fs=%ds the test below works fine and this is expected. But I cannot change the test in order to have another value of %fs and do not get a segmentation fault. I start thinking that changing the value of %fs is not the only thing to do. So I'm looking for an advice how to make a part of memory addressed by %fs that is not equal to DS.
#include <stddef.h>
typedef char __seg_fs fs_ptr;
fs_ptr p[] = {'h','e','l','l','o','\0'};
void fs_puts(fs_ptr *s)
{
char buf[100];
buf[0] = s[0];
buf[1] = s[1];
buf[2] = s[2];
buf[3] = '\0';
puts(buf);
}
void __attribute__((constructor)) set_fs()
{
__asm__("mov %ds, %bx\n\t"
"add $0, %bx\n\t" //<---- if fs=ds then the program executes as expected. If not $0 here, then segmentation fault happens.
"mov %bx, %fs\n\t");
}
int main()
{
fs_puts(p);
return 0;
}
I've talked with Armin who implemented __seg_gs/__seg_fs in GCC (Thanks Armin!).
So basically I cannot use these keywords for globals. The aim of introducing __seg_gs/fs was to have a possibility to dynamically allocate regions of memory that are thread-local.
We cannot use __thread for a pointer and to allocate a memory for it using malloc. But __seg_gs/fs introduce such possibility.
The test below somehow illustrates that.
Note that arch_prctl() was used. It exists as 64-bit version only.
Also note that %fs is used for __thread on 64-bit and %gs is free.
#include <stddef.h>
#include <string.h>
#include <stdio.h>
#include <asm/ldt.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include <asm/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>
typedef __seg_gs char gs_str;
void gs_puts(gs_str *ptr)
{
int i;
char buf[100];
for(i = 0; i < 100; i++)
buf[i] = ptr[i];
puts(buf);
}
int main()
{
int i;
void *buffer = malloc(100 * sizeof(char));
arch_prctl(ARCH_SET_GS, buffer);
gs_str *gsobj = (gs_str *)0;
for (i = 0; i < 100; i++)
gsobj[i] = 'a'; /* in the %gs space */
gs_puts(gsobj);
return 0;
}

Knowing what SIMD instructions OpenMP 4.0 will produce?

Short of checking the actual assembly produced, is there any way to determine what platform-specific instructions will be utilised by OpenMP, for a given use case?
For example, I've identified pcmpeqq i.e. 64-bit integer word equality (SSE 4.1) as the desirable instruction rather than pcmpeqd i.e. 32-bit word equality (SSE 2). Is there any way to know that OpenMP 4.0 will produce the former and not the latter? (spec does not address such specifics.)
The only way to ever guarantee that any compiler will ever emit a particular assembly instruction is to hardcode it. There's no spec in the world that constrains the compiler to generate specific instructions for a given language feature.
Having said that, if support for SSE4.1 or better is specified implicitly or explicitly on the command line, it would greatly surprise me if many compilers emitted SSE2 instructions in situations where the later instructions would work.
Checking the assembly isn't difficult:
$ cat foo.c
#include <stdio.h>
int main(int argc, char **argv) {
const int n=128;
long x[n];
long y[n];
for (int i=0; i<n/2; i++) {
x[i] = y[i] = 1;
x[i+n/2] = 2;
y[i+n/2] = 2;
}
#pragma omp simd
for (int i=0; i<n; i++)
x[i] = (x[i] == y[i]);
for (int i=0; i<n; i++)
printf("%d: %ld\n", i, x[i]);
return 0;
}
$ icc -openmp -msse4.1 -o foo41.s foo.c -S -std=c99 -qopt-report-phase=vec -qopt-report=2
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
$ icc -openmp -msse2 -o foo2.s foo.c -S -std=c99 -qopt-report-phase=vec -qopt-report=2 -o foo2.s
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
And sure enough:
$ grep pcmp foo41.s
pcmpeqq (%rax,%rsi,8), %xmm0 #18.25
$ grep pcmp foo2.s
pcmpeqd (%rax,%rsi,8), %xmm2 #18.25

Failing to link c code to lapack: undefined reference

I am trying to use lapack functions from C.
Here is some test code, copied from this question
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "clapack.h"
#include "cblas.h"
void invertMatrix(float *a, unsigned int height){
int info, ipiv[height];
info = clapack_sgetrf(CblasColMajor, height, height, a, height, ipiv);
info = clapack_sgetri(CblasColMajor, height, a, height, ipiv);
}
void displayMatrix(float *a, unsigned int height, unsigned int width)
{
int i, j;
for(i = 0; i < height; i++){
for(j = 0; j < width; j++)
{
printf("%1.3f ", a[height*j + i]);
}
printf("\n");
}
printf("\n");
}
int main(int argc, char *argv[])
{
int i;
float a[9], b[9], c[9];
srand(time(NULL));
for(i = 0; i < 9; i++)
{
a[i] = 1.0f*rand()/RAND_MAX;
b[i] = a[i];
}
displayMatrix(a, 3, 3);
return 0;
}
I compile this with gcc:
gcc -o test test.c \
-lblas -llapack -lf2c
n.b.: I've tried those libraries in various orders, I've also tried others libs like latlas, lcblas, lgfortran, etc.
The error message is:
/tmp//cc8JMnRT.o: In function `invertMatrix':
test.c:(.text+0x94): undefined reference to `clapack_sgetrf'
test.c:(.text+0xb4): undefined reference to `clapack_sgetri'
collect2: error: ld returned 1 exit status
clapack.h is found and included (installed as part of atlas). clapack.h includes the offending functions --- so how can they not be found?
The symbols are actually in the library libalapack (found using strings). However, adding -lalapack to the gcc command seems to require adding -lcblas (lots of undefined cblas_* references). Installing cblas automatically uninstalls atlas, which removes clapack.h.
So, this feels like some kind of dependency hell.
I am on FreeBSD 10 amd64, all the relevant libraries seem to be installed and on the right paths.
Any help much appreciated.
Thanks
Ivan
I uninstalled everything remotely relevant --- blas, cblas, lapack, atlas, etc. --- then reinstalled atlas (from ports) alone, and then the lapack and blas packages.
This time around, /usr/local/lib contained a new lib file: libcblas.so --- previous random installations must have deleted it.
The gcc line that compiles is now:
gcc -o test test.c \
-llapack -lblas -lalapack -lcblas
Changing the order of the -l arguments doesn't seem to make any difference.

openMP is not creating threads in visual studio

My openMP version did not give any speed boost. I have a dual core machine and the CPU usage is always 50%. So I tried the sample program given in Wiki. Looks like the openMP compiler (Visual Studio 2008) is not creating more than one thread.
This is the program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
This is the output that I get:
Hello World from thread 0
There are 1 threads
Press any key to continue . . .
There's nothing wrong with the program - so presumably there's some issue with how it's being compiled or run. Is this VS2008 Pro? A quick google around suggests OpenMP is not enabled in Standard. Is OpenMP enabled in Properties -> C/C++ -> Language -> OpenMP? (Eg, are you compiling with /openmp)? Is the environment variable OMP_NUM_THREADS being set to 1 somewhere when you run this?
If you want to test out your program with more than one thread, there are several constructs for specifying the number of threads in an OpenMP parallel region. They are, in order of precedence:
Evaluation of the if clause
Setting of the num_threads clause
Use of the omp_set_num_threads() library function
Setting of the OMP_NUM_THREADS environment variable
Implementation default
It sounds like your implementation is defaulting to one thread (assuming you don't have OMP_NUM_THREADS=1 set in your environment).
To test with 4 threads, for instance, you could add num_threads(4) to your #pragma omp parallel directive.
As the other answer noted, you won't really see any "speedup" because you aren't exploiting any parallelism. But it is reasonable to want to run a "hello world" program with several threads to test it out.
As mentioned here, http://docs.oracle.com/cd/E19422-01/819-3694/5_compiling.html I got it working by setting the environment variable OMP_DYNAMIC to FALSE
Why would you need more than one thread for that program? It's clearly the case that OpenMP realizes that it doesn't need to create an extra thread to run a program with no loops, no code that could run in parallel whatsoever.
Try running some parallel stuff with OpenMP. Something like this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define CHUNKSIZE 10
#define N 100
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d starting...\n",tid);
#pragma omp for schedule(dynamic,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
}
If you want some hard core stuff, try running one of these.

Resources