Pthreads in Mac OS X - Mutexes issue - macos

I'm trying to learn how to program parallel algorithms in C using POSIX threads. My environment is a Mac OS X 10.5.5 with gcc 4.
Compiling:
gcc -Wall -D_REENTRANT -lpthread source.c -o test.o
So, my problem is, if I compile this in a Ubuntu 9.04 box, it runs smoothly in thread order, on Mac looks like mutexes doesn't work and the threads don't wait to get the shared information.
Mac:
#1
#0
#2
#5
#3
#4
ubuntu
#0
#1
#2
#3
#4
#5
Any ideas?
Follow below the source code:
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#define NUM_THREADS 6
pthread_mutex_t mutexsum;
pthread_t threads[NUM_THREADS];
long Sum;
void *SumThreads(void *threadid){
int tmp;
int i,x[10],y[10];
// Para cada x e y do vetor, jogamos o valor de i, só para meio didáticos
for (i=0; i<10 ; i++){
x[i] = i;
y[i] = i;
}
tmp = Sum;
for (i=0; i<10 ; i++){
tmp += (x[i] * y[i]);
}
pthread_mutex_lock (&mutexsum);
Sum += tmp;
printf("Im thread #%ld sum until now is: %ld\n",threadid,Sum);
pthread_mutex_unlock (&mutexsum);
return 0;
}
int main(int argc, char *argv[]){
int i;
Sum = 0;
pthread_mutex_init(&mutexsum, NULL);
for(i=0; i<NUM_THREADS; i++){
pthread_create(&threads[i], NULL, SumThreads, (void *)i);
}
pthread_exit(NULL);
}

There is nothing on your code that will make your threads running in ANY order. If in Ubuntu is running on some order, it might be because you are just lucky. Try running 1000 times in Ubuntu and see if you get the same results over and over again.
The thing is, that you can't control the way the scheduler will make your threads access the processor(s). So, when you iterate through the for loop is creating your threads, you can't assume that the first call to pthread_create will get to run first, or will get to lock the mutex you are creating first. It's up to the scheduler which it at the OS level, and you can't control it, unless you write your own kernel :-).
If you want a serial behavior why would you run your code in separate threads in the first place? If it is just for experimentation, then one solution I can think of using pthread_signal to wake a specific thread up and make it running... Then the woken up thread can wake up the second one and so on so forth.
Hope it helps.

To my recollection, the variable you have protected isn't actually being shared amongst the processes. It exists in its own context inside each of the threads. So, it's really just a matter of when each thread gets scheduled that determines what will print.
I don't think one simple mutex will allow you to guarantee correctness, if correctness is defined as printing 0, 1, 2, 3 ...
what your code is doing is creating multiple execution contexts, using the code in your sum function as its execution code. the variable you are protecting, unless declared as static, will be unique to each call of that function.
in the end, it is coincidence that you are getting one system to print out correctly, because you have no logical method of blocking threads until it is their proper turn.

I don't do pthreads in C or any other language (but I do thread programming on high-performace computers) so this 'answer' might be useless to you;
What in your code requires the threads to pass the mutex in thread id order ? I see that the threads are created in id order, but what requires them to execute in that order /
If you do require your threads to execute in id order, why ? It seems a bit as if you are creating threads, then serialising them. To what end ?
When I program in threads and worry about execution order, I often try creating a very large number of threads and seeing what happens to the execution order.
As I say, ignore this if my lack of understanding of C and pthreads is too poor.

Related

What is causing this error: SSE register return with SSE disabled?

I'm new to kernel development, and I need to write a Linux kernel module that performs several matrix multiplications (I'm working on an x64_64 platform). I'm trying to use fixed-point values for these operations, however during compilation, the compiler encounters this error:
error: SSE register return with SSE disabled
I don't know that much about SSE or this issue in particular, but from what i've found and according to most answers to questions about this problem, it is related to the usage of Floating-Point (FP) arithmetic in kernel space, which seems to be rarely a good idea (hence the utilization of Fixed-Point arithmetics). This error seems weird to me because I'm pretty sure I'm not using any FP values or operations, however it keeps popping up and in some ways that seem weird to me. For instance, I have this block of code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
const int scale = 16;
#define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
#define FIXED_TO_DOUBLE(x) ((x) / (1 << scale))
#define MULT(x, y) ((((x) >> 8) * ((y) >> 8)) >> 0)
#define DIV(x, y) (((x) << 8) / (y) << 8)
#define OUTPUT_ROWS 6
#define OUTPUT_COLUMNS 2
struct matrix {
int rows;
int cols;
double *data;
};
double outputlayer_weights[OUTPUT_ROWS * OUTPUT_COLUMNS] =
{
0.7977986, -0.77172316,
-0.43078753, 0.67738613,
-1.04312621, 1.0552227 ,
-0.32619684, 0.14119884,
-0.72325027, 0.64673559,
0.58467862, -0.06229197
};
...
void matmul (struct matrix *A, struct matrix *B, struct matrix *C) {
int i, j, k, a, b, sum, fixed_prod;
if (A->cols != B->rows) {
return;
}
for (i = 0; i < A->rows; i++) {
for (j = 0; j < B->cols; j++) {
sum = 0;
for (k = 0; k < A->cols; k++) {
a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
b = DOUBLE_TO_FIXED(B->data[k * B->rows + j]);
fixed_prod = MULT(a, b);
sum += fixed_prod;
}
/* Commented the following line, causes error */
//C->data[i * C->rows + j] = sum;
}
}
}
...
static int __init insert_matmul_init (void)
{
printk(KERN_INFO "INSERTING MATMUL");
return 0;
}
static void __exit insert_matmul_exit (void)
{
printk(KERN_INFO "REMOVING MATMUL");
}
module_init (insert_matmul_init);
module_exit (insert_matmul_exit);
which compiles with no errors (I left out code that I found irrelevant to the problem). I have made sure to comment any error-prone lines to get to a point where the program can be compiled with no errors, and I am trying to solve each of them one by one. However, when uncommenting this line:
C->data[i * C->rows + j] = sum;
I get this error message in a previous (unmodified) line of code:
error: SSE register return with SSE disabled
sum += fixed_prod;
~~~~^~~~~~~~~~~~~
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error. Maybe my fixed-point implementation is flawed (I'm no expert in that matter either), or maybe I'm missing something obvious. Just in case, I have tested the same logic in a user-space program (using Floating-Point values) and it seems to work fine. In either case, any help in solving this issue would be appreciated. Thanks in advance!
Edit: I have included the definition of matrix and an example matrix. I have been using the default kbuild command for building external modules, here is what my Makefile looks like:
obj-m = matrix_mult.o
KVERSION = $(shell uname -r)
all:
make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules
Linux compiles kernel code with -mgeneral-regs-only on x86, which produces this error in functions that do anything with FP or SIMD. (Except via inline asm, because then the compiler doesn't see the FP instructions, only the assembler does.)
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error.
GCC optimizes whole functions when optimization is enabled, and you are using FP inside that function. You're doing FP multiply and truncating conversion to integer with your macro and assigning the result to an int, since the MCVE you eventually provided shows struct matrix containing double *data.
If you stop the compiler from using FP instructions (like Linux does by building with -mgeneral-regs-only), it refuses to compile your file instead of doing software floating-point.
The only odd thing is that it pins down the error to an integer += instead of one of the statements that compiles to a mulsd and cvttsd2si
If you disable optimization (-O0 -mgeneral-regs-only) you get a more obvious location for the same error (https://godbolt.org/z/Tv5nG6nd4):
<source>: In function 'void matmul(matrix*, matrix*, matrix*)':
<source>:9:33: error: SSE register return with SSE disabled
9 | #define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
| ~~~~~^~~~~~~~~~~~~~~
<source>:46:21: note: in expansion of macro 'DOUBLE_TO_FIXED'
46 | a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
| ^~~~~~~~~~~~~~~
If you really want to know what's going on with the GCC internals, you could dig into it with -fdump-tree-... options, e.g. on the Godbolt compiler explorer there's a dropdown for GCC Tree / RTL output that would let you look at the GIMPLE or RTL internal representation of your function's logic after various analyzer passes.
But if you just want to know whether there's a way to make this function work, no obviously not, unless you compile a file without -mgeneral-registers-only. All functions in a file compiled that way must only be called by callers that have used kernel_fpu_begin() before the call. (and kernel_fpu_end after).
You can't safely use kernel_fpu_begin inside a function compiled to allow it to use SSE / x87 registers; it might already have corrupted user-space FPU state before calling the function, after optimization. The symptom of getting this wrong is not a fault, it's corrupting user-space state, so don't assume that happens to work = correct. Also, depending on how GCC optimizes, the code-gen might be fine with your version, but might be broken with earlier or later GCC or clang versions. I somewhat expect that kernel_fpu_begin() at the top of this function would get called before the compiler did anything with FP instructions, but that doesn't mean it would be safe and correct.
See also Generate and optimize FP / SIMD code in the Linux Kernel on files which contains kernel_fpu_begin()?
Apparently -msse2 overrides -mgeneral-regs-only, so that's probably just an alias for -mno-mmx -mno-sse and whatever options disables x87. So you might be able to use __attribute__((target("sse2"))) on a function without changing build options for it, but that would be x86-specific. Of course, so is -mgeneral-regs-only. And there isn't a -mno-general-regs-only option to override the kernel's normal CFLAGS.
I don't have a specific suggestion for the best way to set up a build option if you really do think it's worth using kernel_fpu_begin at all, here (rather than using fixed-point the whole way through).
Obviously if you do save/restore the FPU state, you might as well use it for the loop instead of using FP to convert to fixed-point and back.

Is using -fopenmp necessary when using thread sanitizers in clang or gcc

I am trying to use threadsanitizer on given piece of code(in ok.c file) as:
clang -fsanitize=thread ok.c -w -I../runtime
This works fine and no data race is detected, but when I try giving -fopenmp option to sanitizer it dumps the terminal with possible location of data race in the loop.
clang -fsanitize=thread -fopenmp ok.c -w -I../runtime
Terminal output:
$
WARNING: ThreadSanitizer: data race (pid=7980)
Atomic read of size 1 at 0x7d680001f700 by thread T2:
#0 pthread_mutex_lock <null> (a.out+0x000000439b00)
#1 __kmp_reap_worker <null> (libomp.so.5+0x0000000477a2)
int l_3438[10]; //shared
int i;
#pragma omp parallel for
for (i = 0; i < 10; i++){
l_3438[i] = (-10L);
}
I tried using shared and private attributes as well to make things more clear.
int l_3438[10]; //shared
int i;
#pragma omp parallel for shared(l_3438) private(i)
for (i = 0; i < 10; i++){
l_3438[i] = (-10L);
}
Question: Is -fopenmp flag necessary when using thread sanitizer?
Thanks.
Unless you are concerned about false positives (compiler diagnosing data races when there are none) I find the question (as it is posted) should be reversed. It should have been: Should I use thread sanitizer for openmp programs?
If your aim is to detect data races that might result from using openmp constructs, then you should definitely use thread sanitizer with such programs.
And if your question is really about avoiding false positives when using thread sanitizers with openmp programs, this is covered in this post.

OpenACC 2.0 routine: data locality

Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine directive:
#include <iostream>
#pragma acc routine
int function(int *ARRAY,int multiplier){
int sum=0;
#pragma acc loop reduction(+:sum)
for(int i=0; i<10; ++i){
sum+=multiplier*ARRAY[i];
}
return sum;
}
int main(){
int *ARRAY = new int[10];
int multiplier = 5;
int out;
for(int i=0; i<10; i++){
ARRAY[i] = 1;
}
#pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)
#pragma acc parallel present(out,ARRAY[0:10],multiplier)
if (function(ARRAY,multiplier) == 50){
out = 1;
}else{
out = 0;
}
#pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)
std::cout << out << std::endl;
}
How does function know to use the device copies of ARRAY[0:10] and multiplier when it is called from within a parallel region? How can we enforce the use of the device copies?
When your routine is called within a device region (the parallel in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting the present data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could call acc_on_device, but that only tells you that you're running on the accelerator, not that you received a device pointer.
If you want to enforce the use of device copies more than that, you could make the routine nohost so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.
Keep in mind though that any code inside a parallel region that is not inside a loop will be run gang-redundantly, so the write to out is likely a race condition, unless you happen to be running with one gang or you write to it using an atomic.
Basically, when you involved "data" clause, the device will create/copy data to the device memory, then the block of code that defined with "acc routine" will be executed on the device. Notice that the memory between host and device does not share unlike multi-threading (OpenMP). So yes, "function" will be using the device copies of ARRAY and multiplier as long as it is under data segment. Hope this helps! :)
You should assign the function with one parallelism level such as gang/worker/vector. It's a more accurate way.
The routine will use the date in device memory.

How to sleep() from kernel init?

I'm debugging some code of the kernel init with an oscilloscope by setting up values on GPIO, what is the best way to sleep() for a given time very early, i.e, in ddr3_init() ?
Thank you
You could use a busy loop that stops after a given time interval. This should sleep for one second (I'm not sure if it works, I put it together by looking at the time.h header):
#include <linux/time.h>
struct timespec start_ts = current_kernel_time();
s64 start = timespec_to_ns(&start_ts);
do {
struct timespec now_ts = current_kernel_time();
s64 now = timespec_to_ns(&now_ts);
} while (now - start < 1000000000ULL);

why this signal handler is called infinitely

I am using Mac OS 10.6.5, g++ 4.2.1. And meet problem with following code:
#include <iostream>
#include <sys/signal.h>
using namespace std;
void segfault_handler(int signum)
{
cout << "segfault caught!!!\n";
}
int main()
{
signal(SIGSEGV, segfault_handler);
int* p = 0;
*p = 100;
return 1;
}
It seems the segfault_handler is called infinitely and keep on print:
segfault caught!!!
segfault caught!!!
segfault caught!!!
...
I am new to Mac development, do you have any idea on what happened?
This is because after your signal handler executes, the EIP is back to the instruction which causes the SIGSEGV - so it executes again, and SIGSEGV is raised again.
Usually ignoring SIGSEGV like you do is meaningless anyway - suppose the instruction actually read some value from a pointer to a register, what would you do? You don't have any 'correct' value to put in the register, so the following code will likely SIGSEGV again or, worse, trigger some logic error.
You should either exit the process when SIGSEGV happens, or return to a known safe point - longjmp should work, if you know that this is indeed the safe point (the only possible example that comes to mind is VM interpreters/JITs).
Have you tried returning 0 instead of 1 in your program? Traditionally, values other than 0 indicate error. Also, does removing the two lines dealing with *p resolve it?

Resources