In order to make some speed comparisons between Cython with SIMD intrinsics (AVX) VS Numpy methods (which from what i know, also provides vectorizing), i have build this simple sum function:
import time
import numpy as np
cimport numpy as np
cimport cython
cdef extern from 'immintrin.h':
ctypedef double __m256d
__m256d __cdecl _mm256_load_pd(const double *to_load) nogil
void __cdecl _mm256_store_pd(double *to_store, __m256d __M) nogil
__m256d __cdecl _mm256_add_pd(__m256d __M1, __m256d __M2) nogil
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing
def sum(len,n_sums):
cdef __m256d __madd1, __madd2 , __msum
cdef double sum4[4]
len=len-(len%4)
a=np.random.rand(len)
cdef double [::1] a_view=a
cdef np.ndarray[np.float64_t,ndim=1] ca=np.copy(a)
cdef long N=n_sums, L=len, i
cdef double s=0.0
cdef Py_ssize_t j
t1=time.clock()
for i in range(N):
s=0.0
with nogil:
__madd1=_mm256_load_pd(&a_view[0])
for j in range(4,L,4):
__madd2=_mm256_load_pd(&a_view[j])
__msum=_mm256_add_pd(__madd1,__madd2)
__madd1=__msum
_mm256_store_pd(&sum4[0],__msum)
s=sum4[0]+sum4[1]+sum4[2]+sum4[3]
t2=time.clock()
print(s, sum4)
print("Cython sum", t2-t1)
t1=time.clock()
for i in range(N):
s=np.sum(ca)
t2=time.clock()
print(s)
print("np sum", t2-t1)
Which gives nice speed up with cython sum, especially with small size arrays.
Now to go further, i would like to parallelize the nogil block with OpenMP. Unfortunately, i can't find a way to do this with cython.parallel.prange/parallel
Related
I have the following 4x4 matrix-vector multiply code:
double const __restrict__ a[16];
double const __restrict__ x[4];
double __restrict__ y[4];
//#pragma GCC unroll 1 - does not work either
#pragma GCC nounroll
for ( int j = 0; j < 4; ++j )
{
double const* __restrict__ aj = a + j * 4;
double const xj = x[j];
#pragma GCC ivdep
for ( int i = 0; i < 4; ++i )
{
y[i] += aj[i] * xj;
}
}
I compile with -O3 -mavx flags. The inner loop is vectorized (single FMAD). However, gcc (7.2) keeps unrolling the outer loop 4 times, unless I use -O2 or lower optimization.
Is there a way to override -O3 unrolling of a particular loop?
NB. Similar #pragma nounroll works if I use Intel icc.
According to the documentation, #pragma GCC unroll 1 is supposed to work, if you place it just so. If it doesn't then you should submit a bug report.
Alternatively, you can use a function attribute to set optimizations, I think:
void myfn () __attribute__((optimize("no-unroll-loops")));
For concise functions
sans full and partial loop unrolling
when required
the following function attribute
please try.
__attribute__((optimize("Os")))
I am newbie to Eigen and hope to use OpenBLAS as a backend of Eigen 3.3.4 on Android/ARMv7. From the following site I tried a test to use them in one application (Compiling environment is ubuntu 16.04 + Android NDK r15c.),
http://eigen.tuxfamily.org/dox-devel/TopicUsingBlasLapack.html
gemm.cpp has code as follow,
#include <iostream>
#include <Eigen/Dense>
#include "cblas.h"
using namespace Eigen;
int main()
{
double A[6] = {1.0, 2.0, 1.0,-3.0, 4.0,-1.0};
double B[6] = {1.0, 2.0, 1.0,-3.0, 4.0,-1.0};
double C[9] = {.5 , .5 , .5 , .5 , .5 , .5 , .5 , .5 , .5};
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasTrans, 3, 3, 2, 1, A, 3, B, 3, 2, C, 3);
return 0;
}
My Android.mk looks like this,
LOCAL_PATH := $(call my-dir)
#build a test executable
include $(CLEAR_VARS)
LOCAL_MODULE := gemm
LOCAL_C_INCLUDES := /home/yangfan/workspace/study/eigen-3.3.4
LOCAL_C_INCLUDES += /home/yangfan/workspace/study/openBLAS
LOCAL_SRC_FILES := $(LOCAL_PATH)/gemm.cpp
LOCAL_CFLAGS += -DEIGEN_USE_BLAS
LOCAL_CFLAGS += -fPIC -frtti -fexceptions -lz -O3
LOCAL_LDLIBS += -lm -llog -lz
LOCAL_LDLIBS += $(LOCAL_PATH)/openblas-libs/libopenblas.a
include $(BUILD_EXECUTABLE)
When trying to compile the project, I encountered errors as follow(I picked one to paste here),
In file included from ././gemm.cpp:4:
In file included from /home/yangfan/workspace/study/openBLAS/cblas.h:5:
In file included from /home/yangfan/workspace/study/openBLAS/common.h:751:
/home/yangfan/workspace/study/openBLAS/common_interface.h:105:9: error: functions that differ only in their return type cannot be overloaded
void BLASFUNC(dcopy) (blasint *, double *, blasint *, double *, blasint *);
/home/yangfan/workspace/study/eigen-3.3.4/Eigen/src/Core/util/../../misc/blas.h:44:8: note: previous declaration is here
int BLASFUNC(dcopy) (int *, double *, int *, double *, int *);
There are different return types for the same blas functions in openblas and eigen.
Q1. Why are there different return types for the same blas APIs in OpenBLAS and Eigen?
Q2. Is there something missing? Hope some guides to use OpenBLAS as a backend of Eigen.
Q3. Which version is higher, 3.3.4 or 3.3.90? ^-^
thanks so much for your help.
It seems to have two interface styles - fortran blas hand cblas. Eigen only supports fortran blas calls and developers need not provide a header file. I should remove Eigen/Dense if only using cblas functions.
But I still am puzzled at the issue. Why do Eigen and openblas define different return types for fortran-style functions?
in common_interface.h of OpenBLAS,
void BLASFUNC(dgemm)(char *, char *, blasint *, blasint *, blasint *, double *,
double *, blasint *, double *, blasint *, double *, double *, blasint *);
in misc/blas.h of Eigen,
int BLASFUNC(dgemm)(const char *, const char *, const int *, const int *, const int *, const double *,
const double *, const int *, const double *, const int *, const double *, double *, const int *);
I have a small cmake project that works perfectly well on Linux but fails on Windows 10 (I tried with two different computers) with the latest versions of cmake and CUDA 8. It finds CUDA just fine, but fails to find the libraries. My cmake file:
cmake_minimum_required(VERSION 3.0)
project(myproject)
find_package(CUDA REQUIRED)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject ${CUDA_CUBLAS_LIBRARIES})
message(STATUS "")
message(STATUS "FoundCUDA : ${CUDA_FOUND}")
message(STATUS "Cuda cublas libraries : ${CUDA_CUBLAS_LIBRARIES}")
In the same folder, I have the header matrix_mm.cuh:
#include <cstdlib>
namespace myproject {
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n);
} /* end namespace myproject */
and matrix_mm.cu:
#include <cublas_v2.h>
#include "matrix_mm.cuh"
namespace myproject {
// Adapted from https://solarianprogrammer.com/2012/05/31/matrix-multiplication-cuda-cublas-curand-thrust/
void gpu_blas_mmul(const float *a, const float *b, float *c, const size_t m, const size_t k, const size_t n) {
int lda = m, ldb = k, ldc = m;
const float alf = 1;
const float bet = 0;
const float *alpha = &alf;
const float *beta = &bet;
// Create a handle for CUBLAS
cublasHandle_t handle;
cublasCreate(&handle);
// Do the actual multiplication
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
// Destroy the handle
cublasDestroy(handle);
}
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n) {
size_t const a_bytes = m * k * sizeof(float);
size_t const b_bytes = k * n * sizeof(float);
size_t const c_bytes = m * n * sizeof(float);
float* c = (float*)std::malloc(c_bytes);
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, a_bytes);
cudaMalloc(&d_B, b_bytes);
cudaMalloc(&d_C, c_bytes);
cudaMemcpy(d_A, a, a_bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, b, b_bytes, cudaMemcpyHostToDevice);
gpu_blas_mmul(d_A, d_B, d_C, m, k, n);
cudaMemcpy(c, d_C, c_bytes, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return c;
}
} /* end namespace myproject */
On Linux I get:
-- FoundCUDA : TRUE
-- Toolkit root : /usr
-- Cuda cublas libraries : /usr/lib/x86_64-linux-gnu/libcublas.so
While on both Windows 10 machines I get
-- FoundCUDA : TRUE
-- Toolkit root : C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0
-- Cuda cublas libraries : CUDA_cublas_LIBRARY-NOTFOUND;CUDA_cublas_device_LIBRARY-NOTFOUND
...and of course it fails to compile because the linker can't find cublas.
I tried quite a few things: making the lib SHARED instead of STATIC, I made sure Cuda was in Windows' environmental variables, etc, but nothing works.
This is the CMake snippet I use to find CUDA 8 on Windows 10 with CMake 3.7.1:
cmake_minimum_required(VERSION 3.7)
project(myproject)
# Check for CUDA ENV vars
IF(NOT DEFINED ENV{CUDA_PATH})
MESSAGE(FATAL_ERROR "CUDA_PATH Environment variable is not set.")
ENDIF(NOT DEFINED ENV{CUDA_PATH})
# Set the toolkit path
FILE(TO_CMAKE_PATH "$ENV{CUDA_PATH}" CUDA_TOOLKIT_ROOT_DIR)
SET(CUDA_TOOLKIT_ROOT_DIR ${CUDA_TOOLKIT_ROOT_DIR} CACHE STRING "Root directory of the Cuda Library" FORCE)
# Find the package
find_package(CUDA REQUIRED)
# Create and interface library as a link target (Requires CMake 3.7.0+)
add_library(cuda INTERFACE)
set_target_properties(cuda PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES ${CUDA_INCLUDE_DIRS}
INTERFACE_LINK_LIBRARIES "${CUDA_LIBRARIES};${CUDA_CUFFT_LIBRARIES};${CUDA_CUBLAS_LIBRARIES}"
)
SET(CUDA_HOST_COMPILATION_CPP ON)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject cuda)
I think the quotes around the libraries are important since the Windows path will contain spaces.
I would also make sure that you delete your cache and regenerate the project. It's often the cause of errors when variable values appear correct on the surface (or when you make changes to a non-FORCE cache variable).
I'm trying to create a custom version of numpy.argmin that goes over a 2D array and finds the minimum (it's a custom version because I have some domain-specific information numpy doesn't have, which will result in faster execution).
My first implementation was something like this:
cdef my_argmin(array):
cdef double[:, ::1] mv = array # I know the array is in C-order
cdef int i, j, min_i, min_j
cdef double tmp, min
min = 1e50
for i in range(1000):
for j in range(1000):
tmp = mv[i, j]
if tmp < min:
min, min_i, min_j = tmp, i, j
return (i,j)
This takes about twice as long as numpy's own argmin. The problem was with accessing the memory-view. It's slow. Once I added some pointers:
cdef double *row_ptr
for i in range(1000):
row_ptr = &mv[i,0]
for j in range(1000):
tmp = row_ptr[j]
performance came close enough to numpy's argmin.
I noticed the same happening with 1-dimensional memory-typed views, requiring me to replace each one with a pointer before iterating it.
I have the following declaration at the top of my pyx file, just to make sure.
# cython: boundscheck=False, wraparound=False, nonecheck=False
I'm trying to learn how to exploit vectorization with gcc. I followed this tutorial of Erik Holk ( with source code here )
I just modified it to double. I used this dotproduct to compute multiplication of randomly generated square matrices 1200x1200 of doubles ( 300x300 double4 ). I checked that the results are the same. But what really surprised me is, that the simple dotproduct was actually 10x faster than my manually vectorized.
maybe, double4 is too big for SSE ( it would need AVX2 ? ) But I would expect that even in case when gcc cannot find suitable instruction for dealing with double4 at once, it would still be able to exploit the explicit information that data are in big chunks for auto-vectorization.
Details:
the results was:
dot_simple:
time elapsed 1.90000 [s] for 1.728000e+09 evaluations => 9.094737e+08 [ops/s]
dot_SSE:
time elapsed 15.78000 [s] for 1.728000e+09 evaluations => 1.095057e+08 [ops/s]
I used gcc 4.6.3 on Intel® Core™ i5 CPU 750 # 2.67GHz × 4 with these options -std=c99 -O3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math or with just -O2
( the result was the same )
I did it using python/scipy.weave() for convenience, but I hope it doesn't change anything
The code:
double dot_simple( int n, double *a, double *b ){
double dot = 0;
for (int i=0; i<n; i++){
dot += a[i]*b[i];
}
return dot;
}
and that one using explicitly gcc vector extensiobns
double dot_SSE( int n, double *a, double *b ){
const int VECTOR_SIZE = 4;
typedef double double4 __attribute__ ((vector_size (sizeof(double) * VECTOR_SIZE)));
double4 sum4 = {0};
double4* a4 = (double4 *)a;
double4* b4 = (double4 *)b;
for (int i=0; i<n; i++){
sum4 += *a4 * *b4 ;
a4++; b4++;
//sum4 += a4[i] * b4[i];
}
union { double4 sum4_; double sum[VECTOR_SIZE]; };
sum4_ = sum4;
return sum[0]+sum[1]+sum[2]+sum[3];
}
Then I used it for multiplication of 300x300 random matrix to measure performance
void mmul( int n, double* A, double* B, double* C ){
int n4 = n*4;
for (int i=0; i<n4; i++){
for (int j=0; j<n4; j++){
double* Ai = A + n4*i;
double* Bj = B + n4*j;
C[ i*n4 + j ] = dot_SSE( n, Ai, Bj );
//C[ i*n4 + j ] = dot_simple( n4, Ai, Bj );
ijsum++;
}
}
}
scipy weave code:
def mmul_2(A, B, C, __force__=0 ):
code = r''' mmul( NA[0]/4, A, B, C ); '''
weave_options = {
'extra_compile_args': ['-std=c99 -O3 -ftree-vectorize -unroll-loops --param max-unroll-times=4 -ffast-math'],
'compiler' : 'gcc', 'force' : __force__ }
return weave.inline(code, ['A','B','C'], verbose=3, headers=['"vectortest.h"'],include_dirs=['.'], **weave_options )
One of the main problems is that in your function dot_SSE you loop over n items when you should only loop over n/2 items (or n/4 with AVX).
To fix this with GCC's vector extensions you can do this:
double dot_double2(int n, double *a, double *b ) {
typedef double double2 __attribute__ ((vector_size (16)));
double2 sum2 = {};
int i;
double2* a2 = (double2*)a;
double2* b2 = (double2*)b;
for(i=0; i<n/2; i++) {
sum2 += a2[i]*b2[i];
}
double dot = sum2[0] + sum2[1];
for(i*=2;i<n; i++) dot +=a[i]*b[i];
return dot;
}
The other problem with your code is that it has a dependency chain. Your CPU can do a simultaneous SSE addition and multiplication but only for independent data paths. To fix this you need to unroll the loop. The following code unrolls the loop by 2 (but you probably need to unroll by three for the best results).
double dot_double2_unroll2(int n, double *a, double *b ) {
typedef double double2 __attribute__ ((vector_size (16)));
double2 sum2_v1 = {};
double2 sum2_v2 = {};
int i;
double2* a2 = (double2*)a;
double2* b2 = (double2*)b;
for(i=0; i<n/4; i++) {
sum2_v1 += a2[2*i+0]*b2[2*i+0];
sum2_v2 += a2[2*i+1]*b2[2*i+1];
}
double dot = sum2_v1[0] + sum2_v1[1] + sum2_v2[0] + sum2_v2[1];
for(i*=4;i<n; i++) dot +=a[i]*b[i];
return dot;
}
Here is a version using double4 which I think is really what you wanted with your original dot_SSE function. It's ideal for AVX (though it still needs to be unrolled) but it will still work with SSE2 as well. In fact with SSE it seems GCC breaks it into two chains which effectively unrolls the loop by 2.
double dot_double4(int n, double *a, double *b ) {
typedef double double4 __attribute__ ((vector_size (32)));
double4 sum4 = {};
int i;
double4* a4 = (double4*)a;
double4* b4 = (double4*)b;
for(i=0; i<n/4; i++) {
sum4 += a4[i]*b4[i];
}
double dot = sum4[0] + sum4[1] + sum4[2] + sum4[3];
for(i*=4;i<n; i++) dot +=a[i]*b[i];
return dot;
}
If you compile this with FMA it will generate FMA3 instructions. I tested all these functions here (you can edit and compile the code yourself as well) http://coliru.stacked-crooked.com/a/273268902c76b116
Note that using SSE/AVX for a single dot production in matrix multiplication is not the optimal use of SIMD. You should do two (four) dot products at once with SSE (AVX) for double floating point.