Some gcc/clang compiler optimizations allow reordering the execution of code in the assembly (e.g. for gcc: -freorder-blocks -freorder-blocks-and-partition -freorder-functions). Is it safe to use such optimizations when de-/serializing data structures in a specific order?
For instance:
void write(int* data,std::ofstream& outstream)
{
outstream.write(reinterpret_cast<char*>(data),sizeof(int));
}
void read(int* data,std::ifstream& instream)
{
instream.read(reinterpret_cast<char*>(data),sizeof(int));
}
void serialize()
{
std::ofstream ofs("/somePath/to/some.file");
int i = 1;
int j = 2;
int k = 3;
write(i, ofs);
write(j, ofs);
write(k, ofs);
ofs.close();
}
void deserialize()
{
std::ifstream ifs("/somePath/to/some.file");
int i;
int j;
int k;
read(i, ifs);
read(j, ifs);
read(k, ifs);
ifs.close();
}
-freorder-blocks, -freorder-blocks-and-partition, and -freorder-functions all influence the order in which code is laid out in your executable file:
-freorder-blocks allows gcc to rearrange the straight-line blocks of assembly code that combine to form your function to try and minimize the number of branch prediction misses that your CPU might make.
-freorder-functions takes the functions that the compiler deems unlikely to execute and moving them to a far away part of your executable file. The goal of doing so is to try and pack as much frequently-executed code into as little memory as possible so as to benefit from the instruction cache as much as possible.
-freorder-blocks-and-partition is like -freorder-functions, but at the assembly block level. If you have an if (unlikely(foo)) { x = 1; } branch in your code, then the compiler might choose to take the code representing x = 1; and move it away frequently executed code.
None of these will affect the control flow of your program. All optimizations will guarantee that if you write i to a file and then j to a file, then that will still be what is observed after optimizations are applied.
Related
I'm new to kernel development, and I need to write a Linux kernel module that performs several matrix multiplications (I'm working on an x64_64 platform). I'm trying to use fixed-point values for these operations, however during compilation, the compiler encounters this error:
error: SSE register return with SSE disabled
I don't know that much about SSE or this issue in particular, but from what i've found and according to most answers to questions about this problem, it is related to the usage of Floating-Point (FP) arithmetic in kernel space, which seems to be rarely a good idea (hence the utilization of Fixed-Point arithmetics). This error seems weird to me because I'm pretty sure I'm not using any FP values or operations, however it keeps popping up and in some ways that seem weird to me. For instance, I have this block of code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
const int scale = 16;
#define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
#define FIXED_TO_DOUBLE(x) ((x) / (1 << scale))
#define MULT(x, y) ((((x) >> 8) * ((y) >> 8)) >> 0)
#define DIV(x, y) (((x) << 8) / (y) << 8)
#define OUTPUT_ROWS 6
#define OUTPUT_COLUMNS 2
struct matrix {
int rows;
int cols;
double *data;
};
double outputlayer_weights[OUTPUT_ROWS * OUTPUT_COLUMNS] =
{
0.7977986, -0.77172316,
-0.43078753, 0.67738613,
-1.04312621, 1.0552227 ,
-0.32619684, 0.14119884,
-0.72325027, 0.64673559,
0.58467862, -0.06229197
};
...
void matmul (struct matrix *A, struct matrix *B, struct matrix *C) {
int i, j, k, a, b, sum, fixed_prod;
if (A->cols != B->rows) {
return;
}
for (i = 0; i < A->rows; i++) {
for (j = 0; j < B->cols; j++) {
sum = 0;
for (k = 0; k < A->cols; k++) {
a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
b = DOUBLE_TO_FIXED(B->data[k * B->rows + j]);
fixed_prod = MULT(a, b);
sum += fixed_prod;
}
/* Commented the following line, causes error */
//C->data[i * C->rows + j] = sum;
}
}
}
...
static int __init insert_matmul_init (void)
{
printk(KERN_INFO "INSERTING MATMUL");
return 0;
}
static void __exit insert_matmul_exit (void)
{
printk(KERN_INFO "REMOVING MATMUL");
}
module_init (insert_matmul_init);
module_exit (insert_matmul_exit);
which compiles with no errors (I left out code that I found irrelevant to the problem). I have made sure to comment any error-prone lines to get to a point where the program can be compiled with no errors, and I am trying to solve each of them one by one. However, when uncommenting this line:
C->data[i * C->rows + j] = sum;
I get this error message in a previous (unmodified) line of code:
error: SSE register return with SSE disabled
sum += fixed_prod;
~~~~^~~~~~~~~~~~~
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error. Maybe my fixed-point implementation is flawed (I'm no expert in that matter either), or maybe I'm missing something obvious. Just in case, I have tested the same logic in a user-space program (using Floating-Point values) and it seems to work fine. In either case, any help in solving this issue would be appreciated. Thanks in advance!
Edit: I have included the definition of matrix and an example matrix. I have been using the default kbuild command for building external modules, here is what my Makefile looks like:
obj-m = matrix_mult.o
KVERSION = $(shell uname -r)
all:
make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules
Linux compiles kernel code with -mgeneral-regs-only on x86, which produces this error in functions that do anything with FP or SIMD. (Except via inline asm, because then the compiler doesn't see the FP instructions, only the assembler does.)
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error.
GCC optimizes whole functions when optimization is enabled, and you are using FP inside that function. You're doing FP multiply and truncating conversion to integer with your macro and assigning the result to an int, since the MCVE you eventually provided shows struct matrix containing double *data.
If you stop the compiler from using FP instructions (like Linux does by building with -mgeneral-regs-only), it refuses to compile your file instead of doing software floating-point.
The only odd thing is that it pins down the error to an integer += instead of one of the statements that compiles to a mulsd and cvttsd2si
If you disable optimization (-O0 -mgeneral-regs-only) you get a more obvious location for the same error (https://godbolt.org/z/Tv5nG6nd4):
<source>: In function 'void matmul(matrix*, matrix*, matrix*)':
<source>:9:33: error: SSE register return with SSE disabled
9 | #define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
| ~~~~~^~~~~~~~~~~~~~~
<source>:46:21: note: in expansion of macro 'DOUBLE_TO_FIXED'
46 | a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
| ^~~~~~~~~~~~~~~
If you really want to know what's going on with the GCC internals, you could dig into it with -fdump-tree-... options, e.g. on the Godbolt compiler explorer there's a dropdown for GCC Tree / RTL output that would let you look at the GIMPLE or RTL internal representation of your function's logic after various analyzer passes.
But if you just want to know whether there's a way to make this function work, no obviously not, unless you compile a file without -mgeneral-registers-only. All functions in a file compiled that way must only be called by callers that have used kernel_fpu_begin() before the call. (and kernel_fpu_end after).
You can't safely use kernel_fpu_begin inside a function compiled to allow it to use SSE / x87 registers; it might already have corrupted user-space FPU state before calling the function, after optimization. The symptom of getting this wrong is not a fault, it's corrupting user-space state, so don't assume that happens to work = correct. Also, depending on how GCC optimizes, the code-gen might be fine with your version, but might be broken with earlier or later GCC or clang versions. I somewhat expect that kernel_fpu_begin() at the top of this function would get called before the compiler did anything with FP instructions, but that doesn't mean it would be safe and correct.
See also Generate and optimize FP / SIMD code in the Linux Kernel on files which contains kernel_fpu_begin()?
Apparently -msse2 overrides -mgeneral-regs-only, so that's probably just an alias for -mno-mmx -mno-sse and whatever options disables x87. So you might be able to use __attribute__((target("sse2"))) on a function without changing build options for it, but that would be x86-specific. Of course, so is -mgeneral-regs-only. And there isn't a -mno-general-regs-only option to override the kernel's normal CFLAGS.
I don't have a specific suggestion for the best way to set up a build option if you really do think it's worth using kernel_fpu_begin at all, here (rather than using fixed-point the whole way through).
Obviously if you do save/restore the FPU state, you might as well use it for the loop instead of using FP to convert to fixed-point and back.
When trying to get loops to auto-vectorise, I've seen code like this written:
void addFP(int N, float *in1, float *in2, float * restrict out)
{
for (i = 0 ; i < N; i++)
{
out[i] = in1[i] + in2[i];
}
}
Where the restrict keyword is needed reassure the compiler about pointer aliases, so that it can vectorise the loop.
Would something like this do the same thing?
void addFP(int N, float *in1, float *in2, std::unique_ptr<float> out)
{
for (i = 0 ; i < N; i++)
{
out[i] = in1[i] + in2[i];
}
}
If this does work, which is the more portable choice?
tl;dr Can std::unique_ptr be used to replace the restrict keyword in a loop you're trying to auto-vectorise?
restrict is not part of C++11, instead it is part of C99.
std::unique_ptr<T> foo; is telling your compiler: I only need the memory in this scope. Once this scope ends, free the memory.
restrict tells your compiler: I know that you can't know or proof this, but I pinky swear that this is the only reference to this chunk of memory that occurs in this function.
unique_ptr doesn't stop aliases, nor should the compiler assume they don't exist:
int* pointer = new int[3];
int* alias = pointer;
std::unique_ptr<int> alias2(pointer);
std::unique_ptr<int> alias3(pointer); //compiles, but crashes when deleting
So your first version is not valid in C++11 (though it works on many modern compilers) and the second doesn't do the optimization you are expecting. To still get the behavior concider std::valarray.
I don't think so. Suppose this code:
auto p = std::make_unique<float>(0.1f);
auto raw = p.get();
addFP(1, raw, raw, std::move(p));
I wrote a simple program to test memory Synchronization. Use a global queue to share with two
processes, and bind two processes to different cores. my code is blew.
#include<stdio.h>
#include<sched.h>
#define __USE_GNU
void bindcpu(int pid) {
int cpuid;
cpu_set_t mask;
cpu_set_t get;
CPU_ZERO(&mask);
if (pid > 0) {
cpuid = 1;
} else {
cpuid = 5;
}
CPU_SET(cpuid, &mask);
if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
printf("warning: could not set CPU affinity, continuing...\n");
}
}
#define Q_LENGTH 512
int g_queue[512];
struct point {
int volatile w;
int volatile r;
};
volatile struct point g_p;
void iwrite(int x) {
while (g_p.r == g_p.w);
sleep(0.1);
g_queue[g_p.w] = x;
g_p.w = (g_p.w + 1) % Q_LENGTH;
printf("#%d!%d", g_p.w, g_p.r);
}
void iread(int *x) {
while (((g_p.r + 1) % Q_LENGTH) == g_p.w);
*x = g_queue[g_p.r];
g_p.r = (g_p.r + 1) % Q_LENGTH;
printf("-%d*%d", g_p.r, g_p.w);
}
int main(int argc, char * argv[]) {
//int num = sysconf(_SC_NPROCESSORS_CONF);
int pid;
pid = fork();
g_p.r = Q_LENGTH;
bindcpu(pid);
int i = 0, j = 0;
if (pid > 0) {
printf("call iwrite \0");
while (1) {
iread(&j);
}
} else {
printf("call iread\0");
while (1) {
iwrite(i);
i++;
}
}
}
The data between the two processesIntel(R) Xeon(R) CPU E3-1230 and two cores didn't synchronized.
CPU: Intel(R) Xeon(R) CPU E3-1230
OS: 3.8.0-35-generic #50~precise1-Ubuntu SMP
I want to know beyond IPC How I can synchronize the data between the different cores in user
space ?
If you are wanting your application to manipulate the cpus shared-cache in order to accomplish IPC I don't believe you will be able to do that.
Chapter 9 of "Linux Kernel Development Second Edition" has information on synchronizing multi-threaded applications (including atomic operations, semiphores, barriers, etc...):
http://www.makelinux.net/books/lkd2/ch09
so you may get some ideas on what you are looking for there.
here is a decent write up for Intel® Smart Cache "Software Techniques for Shared-Cache Multi-Core Systems": http://archive.is/hm0y
here are some stackoverflow questions/answers that may help you find the information you are looking for:
Storing C/C++ variables in processor cache instead of system memory
C++: Working with the CPU cache
Understanding how the CPU decides what gets loaded into cache memory
Sorry for bombarding you with links but this is the best I can do without a clearer understanding of what you are looking to accomplish.
I suggest reading "Volatile: Almost Useless for Multi-Threaded Programming" for why volatile should be removed from the example code. Instead, use C11 or C++11 atomic operations. See also the Fenced Data Transfer example in of the TBB Design Patterns Manual.
Below I show the parts of the question example that I changed to use C++11 atomics. I compiled it with g++ 4.7.2.
#include <atomic>
...
struct point g_p;
struct point {
std::atomic<int> w;
std::atomic<int> r;
};
void iwrite(int x) {
int w = g_p.w.load(std::memory_order_relaxed);
int r;
while ((r=g_p.r.load(std::memory_order_acquire)) == w);
sleep(0.1);
g_queue[w] = x;
w = (w+1)%Q_LENGTH;
g_p.w.store( w, std::memory_order_release);
printf("#%d!%d", w, r);
}
void iread(int *x) {
int r = g_p.r.load(std::memory_order_relaxed);
int w;
while (((r + 1) % Q_LENGTH) == (w=g_p.w.load(std::memory_order_acquire)));
*x = g_queue[r];
g_p.r.store( (r + 1) % Q_LENGTH, std::memory_order_release );
printf("-%d*%d", r, w);
}
The key changes are:
I removed "volatile" everywhere.
The members of struct point are declared as std::atomic
Some loads and stores of g_p.r and g_p.w are fenced. Others are hoisted.
When loading a variable modified by another thread, the code "snapshots" it into a local variable.
The code uses "relaxed load" (no fence) where a thread loads a variable that no other thread modifies. I hoisted those loads out of the spin loops since there is no point in repeating them.
The code uses "acquiring load" where a thread loads a "message is ready" indicator that is set by another thread, and uses a "releasing store" where it is storing a "message is ready" indicator" to be read by another thread. The release is necessary to ensure that the "message" (queue data) is written before the "ready" indicator (member of g_p) is written. The acquire is likewise necessary to ensure that the "message" is read after the "ready" indicator is seen.
The snapshots are used so that the printf reports the value that the thread actually used, as opposed to some new value that appeared later. In general I like to use the snapshot style for two reasons. First, touching shared memory can be expensive because it often requires cache-line transfers. Second, the style gives me a stable value to use locally without having to worry that a reread might return a different value.
AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilĂ !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.
As this is my first post to stackoverflow I want to thank you all for your valuable posts that helped me a lot in the past.
I use MinGW (gcc 4.4.0) on Windows-7(64) - more specifically I use Nokia Qt + MinGW but Qt is not involved in my Question.
I need to find the address and -more important- the length of specific functions of my application at runtime, in order to encode/decode these functions and implement a software protection system.
I already found a solution on how to compute the length of a function, by assuming that static functions placed one after each other in a source-file, it is logical to be also sequentially placed in the compiled object file and subsequently in memory.
Unfortunately this is true only if the whole CPP file is compiled with option: "g++ -O0" (optimization level = 0).
If I compile it with "g++ -O2" (which is the default for my project) the compiler seems to relocate some of the functions and as a result the computed function length seems to be both incorrect and negative(!).
This is happening even if I put a "#pragma GCC optimize 0" line in the source file,
which is supposed to be the equivalent of a "g++ -O0" command line option.
I suppose that "g++ -O2" instructs the compiler to perform some global file-level optimization (some function relocation?) which is not avoided by using the #pragma directive.
Do you have any idea how to prevent this, without having to compile the whole file with -O0 option?
OR: Do you know of any other method to find the length of a function at runtime?
I prepare a small example for you, and the results with different compilation options, to highlight the case.
The Source:
// ===================================================================
// test.cpp
//
// Intention: To find the addr and length of a function at runtime
// Problem: The application output is correct when compiled with: "g++ -O0"
// but it's erroneous when compiled with "g++ -O2"
// (although a directive "#pragma GCC optimize 0" is present)
// ===================================================================
#include <stdio.h>
#include <math.h>
#pragma GCC optimize 0
static int test_01(int p1)
{
putchar('a');
putchar('\n');
return 1;
}
static int test_02(int p1)
{
putchar('b');
putchar('b');
putchar('\n');
return 2;
}
static int test_03(int p1)
{
putchar('c');
putchar('\n');
return 3;
}
static int test_04(int p1)
{
putchar('d');
putchar('\n');
return 4;
}
// Print a HexDump of a specific address and length
void HexDump(void *startAddr, long len)
{
unsigned char *buf = (unsigned char *)startAddr;
printf("addr:%ld, len:%ld\n", (long )startAddr, len);
len = (long )fabs(len);
while (len)
{
printf("%02x.", *buf);
buf++;
len--;
}
printf("\n");
}
int main(int argc, char *argv[])
{
printf("======================\n");
long fun_len = (long )test_02 - (long )test_01;
HexDump((void *)test_01, fun_len);
printf("======================\n");
fun_len = (long )test_03 - (long )test_02;
HexDump((void *)test_02, fun_len);
printf("======================\n");
fun_len = (long )test_04 - (long )test_03;
HexDump((void *)test_03, fun_len);
printf("Test End\n");
getchar();
// Just a trick to block optimizer from eliminating test_xx() functions as unused
if (argc > 1)
{
test_01(1);
test_02(2);
test_03(3);
test_04(4);
}
}
The (correct) Output when compiled with "g++ -O0":
[note the 'c3' byte (= assembly 'ret') at the end of all functions]
======================
addr:4199344, len:37
55.89.e5.83.ec.18.c7.04.24.61.00.00.00.e8.4e.62.00.00.c7.04.24.0a.00.00.00.e8.42
.62.00.00.b8.01.00.00.00.c9.c3.
======================
addr:4199381, len:49
55.89.e5.83.ec.18.c7.04.24.62.00.00.00.e8.29.62.00.00.c7.04.24.62.00.00.00.e8.1d
.62.00.00.c7.04.24.0a.00.00.00.e8.11.62.00.00.b8.02.00.00.00.c9.c3.
======================
addr:4199430, len:37
55.89.e5.83.ec.18.c7.04.24.63.00.00.00.e8.f8.61.00.00.c7.04.24.0a.00.00.00.e8.ec
.61.00.00.b8.03.00.00.00.c9.c3.
Test End
The erroneous Output when compiled with "g++ -O2":
(a) function test_01 addr & len seem correct
(b) functions test_02, test_03 have negative lengths,
and fun. test_02 length is also incorrect.
======================
addr:4199416, len:36
83.ec.1c.c7.04.24.61.00.00.00.e8.c5.61.00.00.c7.04.24.0a.00.00.00.e8.b9.61.00.00
.b8.01.00.00.00.83.c4.1c.c3.
======================
addr:4199452, len:-72
83.ec.1c.c7.04.24.62.00.00.00.e8.a1.61.00.00.c7.04.24.62.00.00.00.e8.95.61.00.00
.c7.04.24.0a.00.00.00.e8.89.61.00.00.b8.02.00.00.00.83.c4.1c.c3.57.56.53.83.ec.2
0.8b.5c.24.34.8b.7c.24.30.89.5c.24.08.89.7c.24.04.c7.04.
======================
addr:4199380, len:-36
83.ec.1c.c7.04.24.63.00.00.00.e8.e9.61.00.00.c7.04.24.0a.00.00.00.e8.dd.61.00.00
.b8.03.00.00.00.83.c4.1c.c3.
Test End
This is happening even if I put a "#pragma GCC optimize 0" line in the source file, which is supposed to be the equivalent of a "g++ -O0" command line option.
I don't believe this is true: it is supposed to be the equivalent of attaching __attribute__((optimize(0))) to subsequently defined functions, which causes those functions to be compiled with a different optimisation level. But this does not affect what goes on at the top level, whereas the command line option does.
If you really must do horrible things that rely on top level ordering, try the -fno-toplevel-reorder option. And I suspect that it would be a good idea to add __attribute__((noinline)) to the functions in question as well.