Slow performance with injected library OS X DYLD_INSERT_LIBRARIES

Slow performance with injected library OS X DYLD_INSERT_LIBRARIES - macos

I wrote a small library that overrides the system library function gettimeofday() with a slightly different one that returns the time minus a hardcoded offset.
I inject that with DYLD_INSERT_LIBRARIES into the application (an old 3D tool that crashes on dates > 2014). However with the library injected the performance is roughly half compared to without it.
I don't really know a lot about how OSX and dynamic libraries work, I just got that working after reading a lot of tutorials and trial and error.
I can't believe the performance drop is caused by just adding that tiny code change but I read it might have something to do with namespace flattening, which I don't understand.
Is there anything I can do to get rid of the performance overhead?
#include <sys/time.h>
//compile:
//clang -c interpose.c
//clang -dynamiclib -o interpose.dylib -install_name interpose.dylib interpose.o
#define DYLD_INTERPOSE(_replacment,_replacee) \
__attribute__((used)) static struct{ const void* replacment; const void* replacee; } _interpose_##_replacee \
__attribute__ ((section ("__DATA,__interpose"))) = { (const void*)(unsigned long)&_replacment, (const void*)(unsigned long) &_replacee };
int mygettimeofday(struct timeval *restrict tp, void *restrict tzp);
DYLD_INTERPOSE(mygettimeofday, gettimeofday)
int mygettimeofday(struct timeval *restrict tp, void *restrict tzp)
{
int ret = gettimeofday(tp, tzp);
tp->tv_sec -= 123123;
return ret;
}
Edit:
I also wrote a tiny test tool that does nothing but call gettimeofday() 100000000 times, and ran it with and without the lib. Without it consistently takes 2.9 seconds, with library injected it takes 3.0 seconds, so the extra code in mygettimeofday should not be the issue.

Related

What is causing this error: SSE register return with SSE disabled?

I'm new to kernel development, and I need to write a Linux kernel module that performs several matrix multiplications (I'm working on an x64_64 platform). I'm trying to use fixed-point values for these operations, however during compilation, the compiler encounters this error:
error: SSE register return with SSE disabled
I don't know that much about SSE or this issue in particular, but from what i've found and according to most answers to questions about this problem, it is related to the usage of Floating-Point (FP) arithmetic in kernel space, which seems to be rarely a good idea (hence the utilization of Fixed-Point arithmetics). This error seems weird to me because I'm pretty sure I'm not using any FP values or operations, however it keeps popping up and in some ways that seem weird to me. For instance, I have this block of code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
const int scale = 16;
#define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
#define FIXED_TO_DOUBLE(x) ((x) / (1 << scale))
#define MULT(x, y) ((((x) >> 8) * ((y) >> 8)) >> 0)
#define DIV(x, y) (((x) << 8) / (y) << 8)
#define OUTPUT_ROWS 6
#define OUTPUT_COLUMNS 2
struct matrix {
int rows;
int cols;
double *data;
};
double outputlayer_weights[OUTPUT_ROWS * OUTPUT_COLUMNS] =
{
0.7977986, -0.77172316,
-0.43078753, 0.67738613,
-1.04312621, 1.0552227 ,
-0.32619684, 0.14119884,
-0.72325027, 0.64673559,
0.58467862, -0.06229197
};
...
void matmul (struct matrix *A, struct matrix *B, struct matrix *C) {
int i, j, k, a, b, sum, fixed_prod;
if (A->cols != B->rows) {
return;
}
for (i = 0; i < A->rows; i++) {
for (j = 0; j < B->cols; j++) {
sum = 0;
for (k = 0; k < A->cols; k++) {
a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
b = DOUBLE_TO_FIXED(B->data[k * B->rows + j]);
fixed_prod = MULT(a, b);
sum += fixed_prod;
}
/* Commented the following line, causes error */
//C->data[i * C->rows + j] = sum;
}
}
}
...
static int __init insert_matmul_init (void)
{
printk(KERN_INFO "INSERTING MATMUL");
return 0;
}
static void __exit insert_matmul_exit (void)
{
printk(KERN_INFO "REMOVING MATMUL");
}
module_init (insert_matmul_init);
module_exit (insert_matmul_exit);
which compiles with no errors (I left out code that I found irrelevant to the problem). I have made sure to comment any error-prone lines to get to a point where the program can be compiled with no errors, and I am trying to solve each of them one by one. However, when uncommenting this line:
C->data[i * C->rows + j] = sum;
I get this error message in a previous (unmodified) line of code:
error: SSE register return with SSE disabled
sum += fixed_prod;
~~~~^~~~~~~~~~~~~
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error. Maybe my fixed-point implementation is flawed (I'm no expert in that matter either), or maybe I'm missing something obvious. Just in case, I have tested the same logic in a user-space program (using Floating-Point values) and it seems to work fine. In either case, any help in solving this issue would be appreciated. Thanks in advance!
Edit: I have included the definition of matrix and an example matrix. I have been using the default kbuild command for building external modules, here is what my Makefile looks like:
obj-m = matrix_mult.o
KVERSION = $(shell uname -r)
all:
make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules

Linux compiles kernel code with -mgeneral-regs-only on x86, which produces this error in functions that do anything with FP or SIMD. (Except via inline asm, because then the compiler doesn't see the FP instructions, only the assembler does.)
From what I understand, there are no FP operations taking place, at least in this section, so I need help figuring out what might be causing this error.
GCC optimizes whole functions when optimization is enabled, and you are using FP inside that function. You're doing FP multiply and truncating conversion to integer with your macro and assigning the result to an int, since the MCVE you eventually provided shows struct matrix containing double *data.
If you stop the compiler from using FP instructions (like Linux does by building with -mgeneral-regs-only), it refuses to compile your file instead of doing software floating-point.
The only odd thing is that it pins down the error to an integer += instead of one of the statements that compiles to a mulsd and cvttsd2si
If you disable optimization (-O0 -mgeneral-regs-only) you get a more obvious location for the same error (https://godbolt.org/z/Tv5nG6nd4):
<source>: In function 'void matmul(matrix*, matrix*, matrix*)':
<source>:9:33: error: SSE register return with SSE disabled
9 | #define DOUBLE_TO_FIXED(x) ((x) * (1 << scale))
| ~~~~~^~~~~~~~~~~~~~~
<source>:46:21: note: in expansion of macro 'DOUBLE_TO_FIXED'
46 | a = DOUBLE_TO_FIXED(A->data[i * A->rows + k]);
| ^~~~~~~~~~~~~~~
If you really want to know what's going on with the GCC internals, you could dig into it with -fdump-tree-... options, e.g. on the Godbolt compiler explorer there's a dropdown for GCC Tree / RTL output that would let you look at the GIMPLE or RTL internal representation of your function's logic after various analyzer passes.
But if you just want to know whether there's a way to make this function work, no obviously not, unless you compile a file without -mgeneral-registers-only. All functions in a file compiled that way must only be called by callers that have used kernel_fpu_begin() before the call. (and kernel_fpu_end after).
You can't safely use kernel_fpu_begin inside a function compiled to allow it to use SSE / x87 registers; it might already have corrupted user-space FPU state before calling the function, after optimization. The symptom of getting this wrong is not a fault, it's corrupting user-space state, so don't assume that happens to work = correct. Also, depending on how GCC optimizes, the code-gen might be fine with your version, but might be broken with earlier or later GCC or clang versions. I somewhat expect that kernel_fpu_begin() at the top of this function would get called before the compiler did anything with FP instructions, but that doesn't mean it would be safe and correct.
See also Generate and optimize FP / SIMD code in the Linux Kernel on files which contains kernel_fpu_begin()?
Apparently -msse2 overrides -mgeneral-regs-only, so that's probably just an alias for -mno-mmx -mno-sse and whatever options disables x87. So you might be able to use __attribute__((target("sse2"))) on a function without changing build options for it, but that would be x86-specific. Of course, so is -mgeneral-regs-only. And there isn't a -mno-general-regs-only option to override the kernel's normal CFLAGS.
I don't have a specific suggestion for the best way to set up a build option if you really do think it's worth using kernel_fpu_begin at all, here (rather than using fixed-point the whole way through).
Obviously if you do save/restore the FPU state, you might as well use it for the loop instead of using FP to convert to fixed-point and back.

Cannot create basic_vectorstream<std::vector<char>> in Boost 1.55+

Moving from Boost 1.54 to 1.55 I now get this error during compilation (VS2010):
void GzipDecompression::Decompress(const unsigned char * src, int length)
{
if(src)
{
// Create an input-stream source for the data buffer so we can used the boost filtering buffer
std::ifstream inputstream;
typedef boost::iostreams::basic_array_source<char> Device;
boost::iostreams::stream_buffer<Device> buffer((char *)src, length);
// Inflate using the GZIP filter
filtering_streambuf<input> in;
in.push(gzip_decompressor());
in.push(buffer);
// Get the result into a vector
boost::interprocess::basic_vectorstream<std::vector<char>> vectorStream;
copy(in, vectorStream);
std::vector<char> output(vectorStream.vector());
}
}
error C2243: 'type cast' : conversion from 'boost::interprocess::basic_vectorstream<CharVector> *' to 'volatile const std::basic_streambuf<_Elem,_Traits> *' exists, but is inaccessible c:\boost\boost_1_55_0\boost\iostreams\traits.hpp 57 1
It appears this is now failing:
boost::interprocess::basic_vectorstream<std::vector<char>> vectorStream;
What has changed so this doesn't compile?
Update After Reply: I've tried changing the output to this:
std::istream instream(&in);
typedef std::istream_iterator<char> istream_iterator;
typedef std::ostream_iterator<char> ostream_iterator;
std::vector<char> output;
std::copy(istream_iterator(instream), istream_iterator(), std::back_inserter(output));
But the output is different. Do I have to flush the stream or something?
Update2: Apparently the istream_iterator strips CR LF etc. Here is my working function
void GzipDecompression::Decompress(const unsigned char * src, int length)
{
if(src)
{
// Create an input-stream source for the data buffer so we can used the boost filtering buffer
std::ifstream inputstream;
typedef boost::iostreams::basic_array_source<char> Device;
boost::iostreams::stream_buffer<Device> buffer((char *)src, length);
// Inflate using the GZIP filter
filtering_streambuf<input> in;
in.push(gzip_decompressor());
in.push(buffer);
// Get the result into a vector
std::istream instream(&in);
typedef std::istreambuf_iterator<char> istreambuf_iterator;
std::vector<char> output;
std::copy(istreambuf_iterator(instream), istreambuf_iterator(), std::back_inserter(output));
}
}
Thanks

Of course the error doesn't emerge from the line you mention. Instead, it's generated deep in the template instantiations for copy_impl. The problem seems to be that Boost Iostreams tries to be smart about detecting when people use "raw buffers" as devices/streams.
The problem with that is that the Interprocess stream implementation (privately) inherits its buffer class and as such, this confuses the detection because a conversion to base seems to be available but not accessible.
This can be reproduced in GCC as well as VS2013 update 4 and all using Boost 1_58_0 as well. As such it is an error that can be reported to the Boost developers. I'd suggest it is a weakness in the Boost Interprocess implementation, although Boost Iostreams devs might be interested in making their overload selection more robust in the presence of private base classes...
In the mean time, consider using a simple boost::iostreams::array_sink or boost::iostreams::back_insert_device (IIRC) which is pretty much guaranteed to play well with Boost Iostreams, and achieves the same goals.

Dereferencing void* warnings on Xcode

I'm aware of this SO question and this SO question. The element
of novelty in this one is in its focus on Xcode, and in its use of
square brackets to dereference a pointer to void.
The following program compiles with no warning in Xcode 4.5.2, compiles
with a warning on GCC 4.2 and, even though I don't have Visual Studio
right now, I remember that it would consider this a compiler
error, and MSDN and Internet agree.
#include <stdio.h>
int main(int argc, const char * argv[])
{
int x = 24;
void *xPtr = &x;
int *xPtr2 = (int *)&xPtr[1];
printf("%p %p\n", xPtr, xPtr2);
}
If I change the third line of the body of main to:
int *xPtr2 = (int *)(xPtr + 1);
It compiles with no warnings on both GCC and Xcode.
I would like to know how can I turn this silence into warnings or errors, on
GDB and especially Xcode/LLVM, including the fact that function main is int but
does not explicitly return any value (By the way I think -Wall does
the trick on GDB).

that isnt wrong at all...
the compiler doesnt know how big the pointer is ... a void[] ~~ void*
thats why char* used as strings need to be \0-terminated
you cannot turn on a warning for that as it isnt possible to determine a 'size of memory pointer to by a pointer' at compile time
void *v = nil;
*v[1] = 0 //invalid
void *v = malloc(sizeof(int)*2);
*v[1] = 0 //valid
*note typed inline on SO -- sorry for any non-working code

Construct a 'long long'

How do you construct a long long in gcc, similar to constructing an int via int()
The following fails in gcc (4.6.3 20120306) (but passes on MSVC for example).
myFunctionCall(someValue, long long());
with error expected primary-expression before 'long' (the column position indicates the first long is the location).
A simple change
myFunctionCall(someValue, (long long)int());
works fine - that is construct an int and cast to long long - indicating that gcc doesn't like the long long ctor.
Summary Solution
To summarize the brilliant explanation below from #birryree:
many compilers don't support long long() and it may not be standards compliant
constructing long long is equivalent to the literal 0LL, so use myFunctionCall(someValue, 0LL)
alternatively use a typedef long_long_t long long then long_long_t()
lastly, consider using uint64_t if you are after a type that is exactly 64 bits on any platform, rather than a type that is at least 64 bits, but may vary on different platforms.

I wanted a definitive answer on what the expected behavior was, so I posted a question on comp.lang.c++.moderated and got some great answers in return. So a thank you goes out to Johannes Schaub, Alf P. Steinbach (both from SO), and Francis Glassborrow for some information
This is not a bug in GCC - in fact it will break across multiple compilers - GCC 4.6, GCC 4.7, and Clang complain about similar errors like primary expression expected before '(' if you try this syntax:
long long x = long long();
Some primitives have spaces, and that is not allowed if you want to use the constructor-style initialization because of binding (long() is bound, but long long() has a free long). Types with spaces in them (like long long) can not use the type()-construction form.
MSVC is more permissive here, though technically non-standard compliant (and it's not a language extension that you can disable).
Solutions/Workarounds
There are alternatives for what you want to do:
Use 0LL as your value in place of attempting long long() - they would produce the same value.
This is how most code will be written too, so it will be most understandable to anyone else reading your code.
From your comments it seems like you really want long long, so you can typedef yourself to always guarantee you have a long long type, like this:
int main() {
typedef long long MyLongLong;
long long x = MyLongLong(); // or MyLongLong x = MyLongLong();
}
Use a template to get around needing explicit naming:
template<typename TypeT>
struct Type { typedef TypeT T(); };
// call it like this:
long long ll = Type<long long>::T();
As I mentioned in my comments, you can use an aliased type, like int64_t (from <cstdint>), which across common platforms is a typedef long long int64_t. This is a more platform dependent than the previous items in this list.
int64_t is a fixed-width type that is 64-bits, which is typically how wide long long is on platforms like linux-x86 and windows-x86. long long is at least 64-bit wide, but can be longer. If your code will only run on certain platforms, or if you really need a fixed-width type, this might be a viable choice.
C++11 Solutions
Thanks to the C++ newsgroup, I learned some additional ways of doing what you want to do, but unfortunately they're only in the realm of C++11 (and MSVC10 doesn't support either, and only very new compilers either way would):
The {} way:
long long ll{}; // does the zero initialization
Using what Johannes refers to as the 'bord tools' in C++11 with std::common_type<T>
#include <type_traits>
int main() {
long long ll = std::common_type<long long>::type();
}
So is there a real difference between () and initializing with 0 for POD types?
You say this in a comment:
I don't think default ctor returns zero always - more typical behaviour is to leave memory untouched.
Well, for primitive types, that is not true at all.
From Section 8.5 of the ISO C++ Standard/2003 (don't have 2011, sorry, but this information didn't change too much):
To default-initialize an object of type T means:
— if T is a non-POD class type (clause 9), the default constructor for T is called (and the initialization is ill-formed if T has no accessible default
constructor);
— if T is an array type, each element is
default-initialized;
— otherwise, the object is zero-initialized.
The last clause is most important here because long long, unsigned long, int, float, etc. are all scalar/POD types, and so calling things like this:
int x = int();
Is exactly the same as doing this:
int x = 0;
Generated code example
Here is a more concrete example of what actually happens in code:
#include <iostream>
template<typename T>
void create_and_print() {
T y = T();
std::cout << y << std::endl;
}
int main() {
create_and_print<unsigned long long>();
typedef long long mll;
long long y = mll();
long long z = 0LL;
int mi = int();
}
Compile this with:
g++ -fdump-tree-original construction.cxx
And I get this in the generated tree dump:
;; Function int main() (null)
;; enabled by -tree-original
{
typedef mll mll;
long long int y = 0;
long long int z = 0;
int mi = 0;
<<cleanup_point <<< Unknown tree: expr_stmt
create_and_print<long long unsigned int> () >>>>>;
<<cleanup_point long long int y = 0;>>;
<<cleanup_point long long int z = 0;>>;
<<cleanup_point int mi = 0;>>;
}
return <retval> = 0;
;; Function void create_and_print() [with T = long long unsigned int] (null)
;; enabled by -tree-original
{
long long unsigned int y = 0;
<<cleanup_point long long unsigned int y = 0;>>;
<<cleanup_point <<< Unknown tree: expr_stmt
(void) std::basic_ostream<char>::operator<< ((struct __ostream_type *) std::basic_ostream<char>::operator<< (&cout, y), endl) >>>>>;
}
Generated Code Implications
So from the code tree generated above, notice that all my variables are just being initialized with 0, even if I use constructor-style default initialization, like with int mi = int(). GCC will generate code that just does int mi = 0.
My template function that just attempts to do default construction of some passed in typename T, where T = unsigned long long, also produced just a 0-initialization code.
Conclusion
So in conclusion, if you want to default construct primitive types/PODs, it's like using 0.

OpenSSL and multi-threads

I've been reading about the requirement that if OpenSSL is used in a multi-threaded application, you have to register a thread identification function (and also a mutex creation function) with OpenSSL.
On Linux, according to the example provided by OpenSSL, a thread is normally identified by registering a function like this:
static unsigned long id_function(void){
return (unsigned long)pthread_self();
}
pthread_self() returns a pthread_t, and this works on Linux since pthread_t is just a typedef of unsigned long.
On Windows pthreads, FreeBSD, and other operating systems, pthread_t is a struct, with the following structure:
struct {
void * p; /* Pointer to actual object */
unsigned int x; /* Extra information - reuse count etc */
}
This can't be simply cast to an unsigned long, and when I try to do so, it throws a compile error. I tried taking the void *p and casting that to an unsigned long, on the theory that the memory pointer should be consistent and unique across threads, but this just causes my program to crash a lot.
What can I register with OpenSSL as the thread identification function when using Windows pthreads or FreeBSD or any of the other operating systems like this?
Also, as an additional question:
Does anyone know if this also needs to be done if OpenSSL is compiled into and used with QT, and if so how to register QThreads with OpenSSL? Surprisingly, I can't seem to find the answer in QT's documentation.

I will just put this code here. It is not panacea, as it doesn't deal with FreeBSD, but it is helpful in most cases when all you need is to support Windows and and say Debian. Of course, the clean solution assumes usage of CRYPTO_THREADID_* family introduced recently. (to give an idea, it has a CRYPTO_THREADID_cmp callback, which can be mapped to pthread_equal)
#include <pthread.h>
#include <openssl/err.h>
#if defined(WIN32)
#define MUTEX_TYPE HANDLE
#define MUTEX_SETUP(x) (x) = CreateMutex(NULL, FALSE, NULL)
#define MUTEX_CLEANUP(x) CloseHandle(x)
#define MUTEX_LOCK(x) WaitForSingleObject((x), INFINITE)
#define MUTEX_UNLOCK(x) ReleaseMutex(x)
#define THREAD_ID GetCurrentThreadId()
#else
#define MUTEX_TYPE pthread_mutex_t
#define MUTEX_SETUP(x) pthread_mutex_init(&(x), NULL)
#define MUTEX_CLEANUP(x) pthread_mutex_destroy(&(x))
#define MUTEX_LOCK(x) pthread_mutex_lock(&(x))
#define MUTEX_UNLOCK(x) pthread_mutex_unlock(&(x))
#define THREAD_ID pthread_self()
#endif
/* This array will store all of the mutexes available to OpenSSL. */
static MUTEX_TYPE *mutex_buf=NULL;
static void locking_function(int mode, int n, const char * file, int line)
{
if (mode & CRYPTO_LOCK)
MUTEX_LOCK(mutex_buf[n]);
else
MUTEX_UNLOCK(mutex_buf[n]);
}
static unsigned long id_function(void)
{
return ((unsigned long)THREAD_ID);
}
int thread_setup(void)
{
int i;
mutex_buf = malloc(CRYPTO_num_locks() * sizeof(MUTEX_TYPE));
if (!mutex_buf)
return 0;
for (i = 0; i < CRYPTO_num_locks( ); i++)
MUTEX_SETUP(mutex_buf[i]);
CRYPTO_set_id_callback(id_function);
CRYPTO_set_locking_callback(locking_function);
return 1;
}
int thread_cleanup(void)
{
int i;
if (!mutex_buf)
return 0;
CRYPTO_set_id_callback(NULL);
CRYPTO_set_locking_callback(NULL);
for (i = 0; i < CRYPTO_num_locks( ); i++)
MUTEX_CLEANUP(mutex_buf[i]);
free(mutex_buf);
mutex_buf = NULL;
return 1;
}

I only can answer the Qt part. Use QThread::currentThreadId(), or even QThread::currentThread() as the pointer value should be unique.

From the OpenSSL doc you linked:
threadid_func(CRYPTO_THREADID *id) is needed to record the currently-executing thread's identifier into id. The implementation of this callback should not fill in id directly, but should use CRYPTO_THREADID_set_numeric() if thread IDs are numeric, or CRYPTO_THREADID_set_pointer() if they are pointer-based. If the application does not register such a callback using CRYPTO_THREADID_set_callback(), then a default implementation is used - on Windows and BeOS this uses the system's default thread identifying APIs, and on all other platforms it uses the address of errno. The latter is satisfactory for thread-safety if and only if the platform has a thread-local error number facility.
As shown providing your own ID is really only useful if you can provide a better ID than OpenSSL's default implementation.
The only fail-safe way to provide IDs, when you don't know whether pthread_t is a pointer or an integer, is to maintain your own per-thread IDs stored as a thread-local value.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Slow performance with injected library OS X DYLD_INSERT_LIBRARIES - macos

Related

What is causing this error: SSE register return with SSE disabled?

Cannot create basic_vectorstream<std::vector<char>> in Boost 1.55+

Dereferencing void* warnings on Xcode

Construct a 'long long'

OpenSSL and multi-threads

Categories

Resources