Speedup comparison between BLAS and OpenBLAS in Armadillo

I've been testing various open source codes for solving a linear system of equations in C++. So far the fastest I've found is armadillo, using the OPENblas package as well. To solve a dense linear NxN system, where N=5000 takes around 8.3 seconds on my system, which is really really fast (without openblas installed, it takes around 30 seconds).
One reason for this increase is that armadillo+openblas seems to enable using multiple threads. It runs on two of my cores, whereas armadillo without openblas only uses 1. I have an i7 processor, so I want to increase the number of cores, and test it further. I'm using ubuntu, so from the openblas documentation I can do in the terminal:
however, running the code again doesn't seem to increase the number of cores being used or the speed. Am i doing something wrong, or is the 2 the max amount for using armadillo's "solve(A,b)" command? I wasn't able to find armadillo's source code anywhere to take a look.
Incidentally does anybody know the methods armadillo/openblas use for solving Ax=b (standard LU decomposition with parallelism or something else) ? Thanks!
edit: Actually the number of cores stuck at 2 seems to be a bug when installing openblas with synaptic package manager see here. Reinstalling from source allows it to detect how many cores i actutally have (8). Now I can use export OPENBLAS_NUM_THREADS=4 etc to govern it.

Armadillo doesn't prevent OpenBlas from using more cores. It's possible that the current implementation of OpenBlas simply chooses 2 cores for certain operations.
You can see Armadillo's source code directly in the downloadable package (it's open source), in the folder "include". Specifically, have a look at the file "include/armadillo_bits/fn_solve.hpp" (which contains the user accessible solve() function), and the file "include/armadillo_bits/auxlib_meat.hpp" (which contains the wrapper and housekeeping code for calling the torturous Blas and Lapack functions).
If you already have Armadillo installed on your machine, have a look at "/usr/include/armadillo_bits" or "/usr/local/include/armadillo_bits".


Benchmarking of the CPU overhead introduced by .so library or Linux executable

I'm looking for a reliable way to measure quantitatively the CPU overhead introduced by my shared library. The library is loaded in the context of php-fpm process (as a PHP extension). Ultimately, I'd like to run a series of tests for two versions of the .so library, collect stats and compare it, to see that to what extent the current version of library is better/worse than the previous one.
I have already tried a few approaches.
"perf stat" connected to PHP-FPM process and measuring cpu-cycles, instructions, and cpu-time
"perf record" connected to PHP-FPM process and collecting CPU cycles. Then I extract data from the collected perf.data. I consider only data related to my .so file and all-inclusive invocations (itself + related syscalls and kernel). So I can get CPU overhead for .so (inclusively).
valgrind on a few scripts running from CLI measuring "instruction requests".
All three options work but don't provide a reliable way for overhead comparison due to deviations (it might be up to 30% error, which is not applicable). It tried to do multiple runs and calculate the average, but the accuracy of that is also questionable.
Valgrind is the most accurate among all, but it provides only the "instructions" which do not give actual CPU overhead. Perf is better (considering the cycles, cpu-time, and instructions), but it gives too high errors from run to run.
Have anyone got experience in similar tasks? Could you recommend other approaches or Linux profilers to measure overhead accurately and quantitatively?

Numerical differences between older Mac Mini and newer Macbook

I have a project that I compile on both my Mac Mini (Core2 Duo) and a 2014 Macbook quadcore i7. Both are running the latest version of Yosemite. The application is single threaded and I am compiling the tool and libraries using the exact same version of cmake and the clang (xcode) compiler. I am getting test failures due to slight numeric differences.
I am wondering if the inconsistency is coming from the clang compiler automatically doing processor specific optimizations, (which I did not select in cmake)? Could the difference be between the processors? Do the frameworks use processor specific optimizations? I am using the BLAS/Lapack routines the from the Accelerate framework. They are called from the SuperLU sparse matrix factorization package.
In general you should not expect results from BLAS or LAPACK to be bitwise reproducible across machines. There are a number of factors that implementors tune to get the best performance, all of which result in small differences in rounding:
your two machines have different numbers of processors, which will result in work being divided differently for threading purposes (even if your application is single threaded, BLAS may use multiple threads internally).
your two machines handle hyper threading quite differently, which may also cause BLAS to use different numbers of threads.
the cache and TLB hierarchy is different between your two machines, which means that different block sizes are optimal for data reuse.
the SIMD vector size on the newer machine is twice as large as that on the older machine, which again will effect how arithmetic is grouped.
finally, the newer machine supports FMA (and using FMA is necessary to get the best performance on it); this also contributes to small differences in rounding.
Any one of these factors would be enough to result in small differences; taken together it should be expected that the results will not be bitwise identical. And that's OK, so long as both results satisfy the error bounds of the computation.
Making the results identical would require severely limiting the performance on the newer machine, which would result in your shiny expensive hardware going to waste.

What makes OpenCV so large on Windows? Anything I can do about it?

The OpenCV x64 distribution (through emgucv) for Windows has almost half a gigabyte of DLLs, including a single 224Mb opencv_gpu.dll. It seems unlikely that any human could have produced that amount of code, so what gives? Large embedded resources? Code generation bloat (this doesn't seem likely given that it's a native c/c++ project)
I want to use it for face recognition, but it's a problem to have such a large binary dependency in git, and it's a hassle to manage outside of source control.
There are no embedded resources (at least the kind Windows DLLs usually have, but since this is a cross-platform product, I'm not sure that's significant.) Maybe lots of initialized C table structures to perform matrix operations?
The size of opencv_gpu is result of numerous template instantiations compiled for several CUDA architecture versions.
For example for convolution:
7 data types (from CV_8U to CV_64F)
~30 hadcoded sizes of convolution kernel
8 CUDA architectures (bin: 1.1 1.2 1.3 2.0 2.1(2.0) 3.0 + ptx: 2.0 3.0)
This produces about 1700 variants of convolution.
This way opencv_gpu can grow up to 1 Gb for the latest OpenCV release.
If you are not going to use any CUDA acceleration then you can safely drop the opencv_gpu.dll

GCC optimization options for AMD Opteron 4280: benchmark

We're moving from one local computational server with 2*Xeon X5650 to another one with 2*Opteron 4280... Today I was trying to launch my wonderful C programs on the new machine (AMD one), and discovered a significant downfall of the performance >50%, keeping all possible parameters the same(even seed for a random numbers generator). I started digging into this problem: googling "amd opteron 4200 compiler options" gave me couple suggestions, i.e., "flags"(options) for available to me GCC 4.6.3 compiler. I played with these flags and summarized my findings on the plots down here...
I'm not allowed to upload pictures, so the charts are here https://plus.google.com/117744944962358260676/posts/EY6djhKK9ab
I'm wondering if anyone (coding folks) could give me any comments on the subject, especially I'm interested in the fact that "... -march=bdver1 -fprefetch-loop-arrays" and "... -fprefetch-loop-arrays -march=bdver1" yield in a different runtime?
I'm not sure also if, let's say "-funroll-all-loops" is already included in "-O3" or "-Ofast", - why then adding this flag one more time makes any difference at all?
Why any additional flags for intel processor makes the performance even worse (except only "-ffast-math" - which is kind of obvious, because it enables less precise and faster by definition floating point arithmetic, as I understand it, though...)?
A bit more details about machines and my program:
2*Xeon X5650 machine is an Ubuntu Server with gcc 4.4.3, it is 2(CPUs on the motherboard)X6(real cores per each)*2(HyperThreading)=24 thread machine, and there was something running on it , during my "experiments" or benchmarks...
2*Opteron 4280 machine is an Ubuntu Server with gcc 4.6.3, it is 2(CPUs on the motherboard)X4(real cores per each=Bulldozer module)*2(AMD Bulldozer whatever threading=kind of a core)=18 thread machine, and I was using it solely for my wonderful "benchmarks"...
My benchmarking program is just a Monte Carlo simulation thing, it does some IO in the beginning, and then ~10^5 Mote Carlo loops to give me the result. So, I assume it is both integer and floating point calculations program, looping every now and then and checking if randomly generated "result" is "good" enough for me or not... The program is just a single-threaded , and I was launching it with the very same parameters for every benchmark(it is obvious, but I should mention it anyway) including random generator seed(so, the results were 100% identical)... The program IS NOT MEMORY INTENSIVE. Resulting runtime is just a "user" time by the standard "/usr/bin/time" command.

What standard techniques are there for using cpu specific features in DLLs?

Short version: I'm wondering if it's possible, and how best, to utilise CPU specific
instructions within a DLL?
Slightly longer version:
When downloading (32bit) DLLs from, say, Microsoft it seems that one size fits all processors.
Does this mean that they are strictly built for the lowest common denominator (ie. the
minimum platform supported by the OS)?
Or is there some technique that is used to export a single interface within the DLL but utilise
CPU specific code behind the scenes to get optimal performance? And if so, how is it done?
I don't know of any standard technique but if I had to make such a thing, I would write some code in the DllMain() function to detect the CPU type and populate a jump table with function pointers to CPU-optimized versions of each function.
There would also need to be a lowest common denominator function for when the CPU type is unknown.
You can find current CPU info in the registry here:
The DLL is expected to work on every computer WIN32 runs on, so you are stuck to the i386 instruction set in general. There is no official method of exposing functionality/code for specific instruction sets. You have to do it by hand and transparently.
The technique used basically is as follows:
- determine CPU features like MMX, SSE in runtime
- if they are present, use them, if not, have fallback code ready
Because you cannot let your compiler optimise for anything else than i386, you will have to write the code using the specific instruction sets in inline assembler. I don't know if there are higher-language toolkits for this. Determining the CPU features is straight forward, but could also need to be done in assembler.
An easy way to get the SSE/SSE2 optimizations is to just use the /arch argument for MSVC. I wouldn't worry about fallback--there is no reason to support anything below that unless you have a very niche application.
I believe gcc/g++ have equivalent flags.
Intel's ICC can compile code twice, for different architectures. That way, you can have your cake and eat it. (OK, you get two cakes - your DLL will be bigger). And even MSVC2005 can do it for very specific cases (E.g. memcpy() can use SSE4)
There are many ways to switch between different versions. A DLL is loaded, because the loading process needs functions from it. Function names are converted into addresses. One solution is to let this lookup depend on not just function name, but also processor features. Another method uses the fact that the name to address function uses a table of pointers in an interim step; you can switch out the entire table. Or you could even have a branch inside critical functions; so foo() calls foo__sse4 when that's faster.
DLLs you download from Microsoft are targeted for the generic x86 architecture for the simple reason that it has to work across all the multitude of machines out there.
Until the Visual Studio 6.0 time frame (I do not know if it has changed) Microsoft used to optimize its DLLs for size rather than speed. This is because the reduction in the overall size of the DLL gave a higher performance boost than any other optimization that the compiler could generate. This is because speed ups from micro optimization would be decidedly low compared to speed ups from not having the CPU wait for the memory. True improvements in speed come from reducing I/O or from improving the base algorithm.
Only a few critical loops that run at the heart of the program could benefit from micro optimizations simply because of the huge number of times they are invoked. Only about 5-10% of your code might fall in this category. You could rest assured that such critical loops would already be optimized in assembler by the Microsoft software engineers to some level and not leave much behind for the compiler to find. (I know it's expecting too much but I hope they do this)
As you can see, there would be only drawbacks from the increased DLL code that includes additional versions of code that are tuned for different architectures when most of this code is rarely used / are never part of the critical code that consumes most of your CPU cycles.
