What is the difference between cffi out-of-line API and ABI mode? - abi

I'm currently learning about interfacing Python and C code (e.g. in this case with cffi). I understood that "out-of-line mode" means that the C-code was compiled to a shared object at install time and in-line mode means that it is done at import time.
I struggle with how to recognize ABI vs API mode in cffi. Is the following MVCE an example for ABI or API?
MVCE
libfib.cpp
int fib(int n) {
int a = 0, b = 1, i, tmp;
if (n <= 1) {
return n;
}
for (int i = 0; i < n - 1; i++) {
tmp = a + b;
a = b;
b = tmp;
}
return b;
}
extern "C" {
extern int cffi_fib(int n) {
return fib(n);
}
}
Compile with g++ -o ./libfib.so ./libfib.cpp -fPIC -shared
import cffi
ffi = cffi.FFI()
ffi.cdef("int cffi_fib(int n);")
C = ffi.dlopen("./libfib.so")
for i in range(10):
print(f"{i}: {C.cffi_fib(i)}")

Related

Why can't clang and gcc optimize away this int-to-float conversion?

Consider the following code:
void foo(float* __restrict__ a)
{
int i; float val;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
void bar(float* __restrict__ a)
{
int i; float val = 0.0;
for (i = 0; i < 100; i++) {
a[i] = val;
val += 2.0;
}
}
They're based on Examples 7.26a and 7.26b in Agner Fog's Optimizing software in C++ and should do the same thing; bar is more "efficient" as written in the sense that we don't do an integer-to-float conversion at every iteration, but rather a float addition which is cheaper (on x86_64).
Here are the clang and gcc results on these two functions (with no vectorization and unrolling).
Question: It seems to me that the optimization of replacing a multiplication by the loop index with an addition of a constant value - when this is beneficial - should be carried out by compilers, even if (or perhaps especially if) there's a type conversion involved. Why is this not happening for these two functions?
Note that if we use int's rather than float's:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
void bar(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
a[i] = val;
val += 2;
}
}
Both clang and gcc perform the expected optimization, albeit not quite in the same way (see this question).
You are looking for enabling induction variable optimization for floating point numbers. This optimization is generally unsafe in floating point land as it changes program semantics. In your example it'll work because both initial value (0.0) and step (2.0) can be precisely represented in IEEE format but this is a rare case in practice.
It could be enabled under -ffast-math but it seems this wasn't considered as important case in GCC as it rejects non-integral induction variables early on (see tree-scalar-evolution.c).
If you believe that this is an important usecase you might consider filing request at GCC Bugzilla.

GCC 6.3.1 doesn't autovectorize without -ffinite-math-only

I would like to understand why GCC does not autovectorize the following loop, unless I pass the -ffinite-math-only. As to my understanding and the GCC manual the optimization requires the -funsafe-math-optimizations
If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON
hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.
In particular, the flag enables the compiler to assume associative math, so that it can first accumulate with 4 partial sums. The code seems pretty straight forward
template<typename SumType = double>
class UipLineResult {
public:
SumType sqsum;
SumType dcsum;
float pkp;
float pkn;
public:
UipLineResult() {
clear();
}
void clear() {
sqsum = 0;
dcsum = 0;
pkp = -std::numeric_limits<float>::max();
pkn = +std::numeric_limits<float>::max();
}
};
Loop that is not vectorized
static void addSamplesLine(const float* ss, UipLineResult<>* line) {
UipLineResult<float> intermediate;
for(int idx = 0; idx < 120; idx++) {
float s = ss[idx];
intermediate.sqsum += s * s;
intermediate.dcsum += s;
intermediate.pkp = intermediate.pkp < s ? s : intermediate.pkp;
intermediate.pkn = intermediate.pkn > s ? s : intermediate.pkn;
}
line->addIntermediate(&intermediate);
}
For example, the squared addition look like
intermediate.sqsum += s * s;
107da: ee47 6aa7 vmla.f32 s13, s15, s15
With -ffinite-math-only this becomes
intermediate.sqsum += s * s;
1054c: ef40 6df0 vmla.f32 q11, q8, q8
Compiler flags
-funsafe-math-optimizations -ffinite-math-only -mcpu=cortex-a9 -mfpu=neon

Is there a way to help auto-vectorizing compiler to emit saturation arithmetic intrinsic in LLVM?

I have a few for loops that does saturated arithmetic operations.
For instance:
Implementation of saturated add in my case is as follows:
static void addsat(Vector &R, Vector &A, Vector &B)
{
int32_t a, b, r;
int32_t max_add;
int32_t min_add;
const int32_t SAT_VALUE = (1<<(16-1))-1;
const int32_t SAT_VALUE2 = (-SAT_VALUE - 1);
const int32_t sat_cond = (SAT_VALUE <= 0x7fffffff);
const uint32_t SAT = 0xffffffff >> 16;
for (int i=0; i<R.length; i++)
{
a = static_cast<uint32_t>(A.data[i]);
b = static_cast<uint32_t>(B.data[i]);
max_add = (int32_t)0x7fffffff - a;
min_add = (int32_t)0x80000000 - a;
r = (a>0 && b>max_add) ? 0x7fffffff : a + b;
r = (a<0 && b<min_add) ? 0x80000000 : a + b;
if ( sat_cond == 1)
{
std_max(r,r,SAT_VALUE2);
std_min(r,r,SAT_VALUE);
}
else
{
r = static_cast<uint16_t> (static_cast<int32_t> (r));
}
R.data[i] = static_cast<uint16_t>(r);
}
}
I see that there is paddsat intrinsic in x86 that could have been the perfect solution to this loop. I do get the code auto vectorized but with a combination of multiple operations according to my code. I would like to know what could be the best way to write this loop that auto-vectorizer finds the addsat operation match right.
Vector structure is:
struct V {
static constexpr int length = 32;
unsigned short data[32];
};
Compiler used is clang 3.8 and code was compiled for AVX2 Haswell x86-64 architecture.

Compiling GSL odeiv2 with g++

I'm attempting to compile the example code relating to the ODE solver, gsl/gsl_odeiv2, using g++. The code below is from their website :
http://www.gnu.org/software/gsl/manual/html_node/ODE-Example-programs.html
and compiles fine under gcc, but g++ throws the error
invalid conversion from 'void*' to 'int (*)(double, const double*, double*, double*,
void*)' [-fpermissive]
in the code :
#include <stdio.h>
#include <gsl/gsl_errno.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_odeiv2.h>
int func (double t, const double y[], double f[], void *params)
{
double mu = *(double *)params;
f[0] = y[1];
f[1] = -y[0] - mu*y[1]*(y[0]*y[0] - 1);
return GSL_SUCCESS;
}
int * jac;
int main ()
{
double mu = 10;
gsl_odeiv2_system sys = {func, jac, 2, &mu};
gsl_odeiv2_driver * d = gsl_odeiv2_driver_alloc_y_new (&sys, gsl_odeiv2_step_rkf45, 1e-6, 1e-6, 0.0);
int i;
double t = 0.0, t1 = 100.0;
double y[2] = { 1.0, 0.0 };
for (i = 1; i <= 100; i++)
{
double ti = i * t1 / 100.0;
int status = gsl_odeiv2_driver_apply (d, &t, ti, y);
if (status != GSL_SUCCESS)
{
printf ("error, return value=%d\n", status);
break;
}
printf ("%.5e %.5e %.5e\n", t, y[0], y[1]);
}
gsl_odeiv2_driver_free (d);
return 0;
}
The error is given on the line
gsl_odeiv2_system sys = {func, jac, 2, &mu};
Any help in solving this issue would be fantastic. I'm hoping to include some stdlib elements, hence wanting to compile it as C++. Also, if I can get it to compile with g++-4.7, I could more easily multithread it using C++11's additions to the language. Thank you very much.
It looks like you have some problems with Jacobian. In your particular case you could just use NULL instead of jac in the definition of your system, i.e.
gsl_odeiv2_system sys = {func, NULL, 2, &mu};
In general you Jacobian must be a function with particular entries - see gsl manual - that is why your compiler is complaining.
Also, you may want to link the gsl library manually:
-L/usr/local/lib -lgsl
if you are on a linux system.

N nested for-loops

I need to create N nested loops to print all combinations of a binary sequence of length N. Im not sure how to do this.
Any help would be greatly appreciated. Thanks.
Use recursion. e.g., in Java
public class Foo {
public static void main(String[] args) {
new Foo().printCombo("", 5);
}
void printCombo(String soFar, int len) {
if (len == 1) {
System.out.println(soFar+"0");
System.out.println(soFar+"1");
}
else {
printCombo(soFar+"0", len-1);
printCombo(soFar+"1", len-1);
}
}
}
will print
00000
00001
00010
...
11101
11110
11111
You have two options here:
Use backtracking instead.
Write a program that generates a dynamic program with N loops and then executes it.
You don't need any nested loops for this. You need one recursive function to print a binary value of length N and a for loop to iterate over all numbers [0 .. (2^N)-1].
user949300's solution is also very good, but it might not work in all languages.
Here's my solution(s), the recursive one is approximately twice as slow as the iterative one:
#include <stdio.h>
#ifdef RECURSIVE
void print_bin(int num, int len)
{
if(len == 0)
{
printf("\n");
return;
}
print_bin(num >> 1, len -1);
putchar((num & 1) + '0');
}
#else
void print_bin(int num, int len)
{
char str[len+1];
int i;
str[len] = '\0';
for (i = 0; i < len; i++)
{
str[len-1-i] = !!(num & (1 << i)) + '0';
}
printf("%s\n", str);
}
#endif
int main()
{
int len = 24;
int i;
int end = 1 << len;
for (i = 0; i < end ; i++)
{
print_bin(i, len);
}
return 0;
}
(I tried this myself on a Mac printing all binary numbers of length 24 and the terminal froze. But that is probably a poor terminal implementation. :-)
$ gcc -O3 binary.c ; time ./a.out > /dev/null ; gcc -O3 -DRECURSIVE binary.c ; time ./a.out > /dev/null
real 0m1.875s
user 0m1.859s
sys 0m0.008s
real 0m3.327s
user 0m3.310s
sys 0m0.010s
I don't think we need recursion or n nested for-loops to solve this problem. It would be easy to handle this using bit manipulation.
In C++, as an example:
for(int i=0;i<(1<<n);i++)
{
for(int j=0;j<n;j++)
if(i&(1<<j))
printf("1");
else
printf("0");
printf("\n");
}

Resources