No speed-up after using MKL for Eigen - visual-studio

I use Eigen 3.3 and Intel MKL 2017, and write and run program in Visual Studio 2012 with Win-7 64-bit system and Intel Xeon(R) CPU E5-1620 v2#3.70GHZ CPU.
I belive that my configuration for MKL is correct, because I can succesfully run MKL examlpe codes. The configuraton for using Intel MKL from Eigen follows from https://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html. For Visiual Studio 2012, I complie codes via Intel C++ Complier in Release x64 model.
However, the following code always takes about 400 seconds, no matter if I set #define EIGEN_USE_MKL_ALL or not (i.e., if use Intel MKL). It seems that MKL does not work in Eigen.
Could anyone give some suggestion? Thanks.
#define EIGEN_USE_MKL_ALL // Determine if use MKL
#define EIGEN_VECTORIZE_SSE4_2
#include "stdafx.h"
#include <iostream>
#include <Eigen/Core>
#include <Eigen/Dense>
#include <time.h>
using namespace std;
using namespace Eigen;
//
int main(int argc, char *argv[])
{
MatrixXd a = MatrixXd::Random(30000, 3000);
MatrixXd b = MatrixXd::Random(3000, 30000);
double start = clock();
MatrixXd c = a * b; //
double endd = clock();
double thisTime = (double)(endd - start) / CLOCKS_PER_SEC;
cout << thisTime << endl;
system("PAUSE");
return 0;
}

Related

Why is C++ so much faster than C in this code?

My C code is:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void){
char* a = (char*)malloc(200000);
for (int i = 0;i< 100000;i++){
strcat(a,"b");
}
printf("%s",a);
}
My C++ code is
#include <iostream>
int main(void){
std::string a = "";
for (int i = 0;i< 100000;i++){
¦ a+="b";
}
std::cout<<a;
}
On my machine, the C code runs in about 5 seconds, while on my machine, the C++ code runs in 0.025! seconds.
Now, the C code doesn't check for overflows, has no C++ overhead, classes and yet is quite a few magnitudes slower than my C++ code.
Using gcc/g++ 6.2.0 compiled with -O3 on Raspberry Pi.
#erwin was correct.
When I change my code to
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void mystrcat(char* src,char* dest,int lenSrc){
src[lenSrc]=dest[0];
}
int main(void){
char* a = (char*)malloc(200000);
for (int i = 0;i< 100000;i++){
mystrcat(a,"b",i);
}
a[100000] = 0;
printf("%s\n",a);
}
It takes about .012s to run (mostly printing the large screen).
Shlemiel's the painter's algorithm at work!

Multiplication - Matrix by imaginary unit

I would like to ask if anybody knows why this is not working:
For example, let
SparseMatrix<int> A
and
SparseMatrix<std::complex<float> > B
I would like to do the following math:
B=i*A
As code:
std::complex<float> c;
c=1.0i;
B=A.cast<std::complex<float> >()*c;
or equivalent:
B=A.cast<std::complex<float> >()*1.0i;
I expect all real values of A to be imaginary in B but
there are only zeros as (0,0).
Example:
#include <Eigen/Sparse>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
using namespace Eigen;
using std::cout;
using std::endl;
int main(int argc, char *argv[]){
int rows=5, cols=5;
SparseMatrix<int> A(rows,cols);
A.setIdentity();
SparseMatrix<std::complex<float> > B;
std::complex<float> c;
c=1i;
B=A.cast<std::complex<float> >()*1.0i;
//B=A.cast<std::complex<float> >()*c;
cout << B << endl;
return 0;
}
compile with:
g++ [name].cpp -o [name]
What am I doing wrong?
Thanks a lot for any help!
You need to enable c++14 to get 1.0i working as expected. With GCC or clang, you need to add the -std=c++14 compiler option.
Then, you can simply do:
MatrixXd A = MatrixXd::Random(3,3);
MatrixXcd B;
B = A * 1.0i;
Same with a SparseMatrix.

how do i include sm_11_atomic_function.h? [duplicate]

I'm having a issue with my kernel.cu class
Calling nvcc -v kernel.cu -o kernel.o I'm getting this error:
kernel.cu(17): error: identifier "atomicAdd" is undefined
My code:
#include "dot.h"
#include <cuda.h>
#include "device_functions.h" //might call atomicAdd
__global__ void dot (int *a, int *b, int *c){
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ){
int sum = 0;
for( int i = 0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
}
}
Some suggest?
You need to specify an architecture to nvcc which supports atomic memory operations (the default architecture is 1.0 which does not support atomics). Try:
nvcc -arch=sm_11 -v kernel.cu -o kernel.o
and see what happens.
EDIT in 2015 to note that the default architecture in CUDA 7.0 is now 2.0, which supports atomic memory operations, so this should not be a problem in newer toolkit versions.
Today with the latest cuda SDK and toolkit this solution will not work.
People also say that adding:
compute_11,sm_11; OR compute_12,sm_12; OR compute_13,sm_13;
compute_20,sm_20;
compute_30,sm_30;
to CUDA in the Project Properties in Visual Studio 2010 will work. It doesn't.
You have to specify this for the .cu file itself in its own properties (Under the C++/CUDA->Device->Code Generation) tab such as:
compute_13,sm_13;
compute_20,sm_20;
compute_30,sm_30;

Detect current CPU Clock Speed Programmatically on OS X?

I just bought a nifty MBA 13" Core i7. I'm told the CPU speed varies automatically, and pretty wildly, too. I'd really like to be able to monitor this with a simple app.
Are there any Cocoa or C calls to find the current clock speed, without actually affecting it?
Edit: I'm OK with answers using Terminal calls, as well as programmatic.
Thanks!
Try this tool called "Intel Power Gadget". It displays IA frequency and IA power in real time.
http://software.intel.com/sites/default/files/article/184535/intel-power-gadget-2.zip
You can query the CPU speed easily via sysctl, either by command line:
sysctl hw.cpufrequency
Or via C:
#include <stdio.h>
#include <sys/types.h>
#include <sys/sysctl.h>
int main() {
int mib[2];
unsigned int freq;
size_t len;
mib[0] = CTL_HW;
mib[1] = HW_CPU_FREQ;
len = sizeof(freq);
sysctl(mib, 2, &freq, &len, NULL, 0);
printf("%u\n", freq);
return 0;
}
Since it's an Intel processor, you could always use RDTSC. That's an assembler instruction that returns the current cycle counter — a 64bit counter that increments every cycle. It'd be a little approximate but e.g.
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
uint64_t rdtsc(void)
{
uint32_t ret0[2];
__asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1]));
return ((uint64_t)ret0[1] << 32) | ret0[0];
}
int main(int argc, const char * argv[])
{
uint64_t startCount = rdtsc();
sleep(1);
uint64_t endCount = rdtsc();
printf("Clocks per second: %llu", endCount - startCount);
return 0;
}
Output 'Clocks per second: 2002120630' on my 2Ghz MacBook Pro.
There is a kernel extension written by "flAked" which logs the cpu p-state to the kernel log.
http://www.insanelymac.com/forum/index.php?showtopic=258612
maybe you could contact him for the code.
This seems to work correctly on OSX.
However, it doesn't work on Linux, where sysctl is deprecated and KERN_CLOCKRATE is undefined.
#include <sys/sysctl.h>
#include <sys/time.h>
int mib[2];
size_t len;
mib[0] = CTL_KERN;
mib[1] = KERN_CLOCKRATE;
struct clockinfo clockinfo;
len = sizeof(clockinfo);
int result = sysctl(mib, 2, &clockinfo, &len, NULL, 0);
assert(result != -1);
log_trace("clockinfo.hz: %d\n", clockinfo.hz);
log_trace("clockinfo.tick: %d\n", clockinfo.tick);

openMP is not creating threads in visual studio

My openMP version did not give any speed boost. I have a dual core machine and the CPU usage is always 50%. So I tried the sample program given in Wiki. Looks like the openMP compiler (Visual Studio 2008) is not creating more than one thread.
This is the program:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
This is the output that I get:
Hello World from thread 0
There are 1 threads
Press any key to continue . . .
There's nothing wrong with the program - so presumably there's some issue with how it's being compiled or run. Is this VS2008 Pro? A quick google around suggests OpenMP is not enabled in Standard. Is OpenMP enabled in Properties -> C/C++ -> Language -> OpenMP? (Eg, are you compiling with /openmp)? Is the environment variable OMP_NUM_THREADS being set to 1 somewhere when you run this?
If you want to test out your program with more than one thread, there are several constructs for specifying the number of threads in an OpenMP parallel region. They are, in order of precedence:
Evaluation of the if clause
Setting of the num_threads clause
Use of the omp_set_num_threads() library function
Setting of the OMP_NUM_THREADS environment variable
Implementation default
It sounds like your implementation is defaulting to one thread (assuming you don't have OMP_NUM_THREADS=1 set in your environment).
To test with 4 threads, for instance, you could add num_threads(4) to your #pragma omp parallel directive.
As the other answer noted, you won't really see any "speedup" because you aren't exploiting any parallelism. But it is reasonable to want to run a "hello world" program with several threads to test it out.
As mentioned here, http://docs.oracle.com/cd/E19422-01/819-3694/5_compiling.html I got it working by setting the environment variable OMP_DYNAMIC to FALSE
Why would you need more than one thread for that program? It's clearly the case that OpenMP realizes that it doesn't need to create an extra thread to run a program with no loops, no code that could run in parallel whatsoever.
Try running some parallel stuff with OpenMP. Something like this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define CHUNKSIZE 10
#define N 100
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
printf("Thread %d starting...\n",tid);
#pragma omp for schedule(dynamic,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
}
If you want some hard core stuff, try running one of these.

Resources