Related
In the LAPACK documentation, it states that DSGESV (or ZCGESV for complex numbers) is:
The dsgesv and zcgesv are mixed precision iterative refinement
subroutines for exploiting fast single precision hardware. They first
attempt to factorize the matrix in single precision (dsgesv) or single
complex precision (zcgesv) and use this factorization within an
iterative refinement procedure to produce a solution with double
precision (dsgesv) / double complex precision (zcgesv) normwise
backward error quality (see below). If the approach fails, the method
switches to a double precision or double complex precision
factorization respectively and computes the solution.
The iterative refinement is not going to be a winning strategy if the
ratio single precision performance over double precision performance
is too small. A reasonable strategy should take the number of
right-hand sides and the size of the matrix into account. This might
be done with a call to ilaenv in the future. At present, iterative
refinement is implemented.
But how can I know what the ratio of single precision performance over double precision performance is? There is the suggestion to take into account the size of the matrix, but I don't see how exactly the size of the matrix leads to an estimate of this performance ratio.
Would anyone be able to clarify these things?
My guess is that the best way to go is to test both dgesv() and dsgesv()...
Looking at the source code of the function dsgesv() of Lapack, here is what dsgesv() tries to perform:
Cast the matrix A to float As
Call sgetrf() : LU factorization, single precision
Solve the system As.x=b using the LU factorization by calling sgetrs()
Compute the double precision residue r=b-Ax and solve As.x'=r using sgetrs() again, add x=x+x'.
The last step is repeated until double precision is acheived (30 iterations max). The criteria defining success is:
where is the precision of double precison floating point numbers (approximately 1e-13) and is the size of the matrix. If it fails, dsgesv() resumes to dgesv() since it calls dgetrf() (factorization) and then dgetrs(). Hence dsgesv() is a mixed precision algorithm. See this article for instance.
Lastly, dsgesv() is expected to outperform dgesv() for small number of right-hand sides and large matrices, that is when the cost of the factorization sgetrf()/dgetrf() is much higher than the one of the substitutions sgetrs()/dgetrs() . Since the maximum number of iteration set in dsgesv() is 30, an approximate limit would be
Moreover, sgetrf() must proove significantly faster than dgetrf(). sgetrf() can be faster due to a limited available memory bandwidth or vector processing (look for SIMD, example from SSE: the instruction ADDPS).
The argument iter of dsgesv() can be tested to check whether the iterative refinement was useful. If it is negative, iterative refinement failed and using dsgesv() was just a waste of time !
Here is a C code to compare and time dgesv(), sgesv(), dsgesv(). It can be compiled by gcc main.c -o main -llapacke -llapack -lblas Feel free to test your own matrix !
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <lapacke.h>
int main(void){
srand (time(NULL));
//size of the matrix
int n=2000;
// number of right-hand size
int nb=3;
int nbrun=1000*100*100/n/n;
//memory initialization
double *aaa=malloc(n*n*sizeof(double));
if(aaa==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
double *aa=malloc(n*n*sizeof(double));
if(aa==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
double *bbb=malloc(n*nb*sizeof(double));
if(bbb==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
double *x=malloc(n*nb*sizeof(double));
if(x==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
double *bb=malloc(n*nb*sizeof(double));
if(bb==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
float *aaas=malloc(n*n*sizeof(float));
if(aaas==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
float *aas=malloc(n*n*sizeof(float));
if(aas==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
float *bbbs=malloc(n*n*sizeof(float));
if(bbbs==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
float *bbs=malloc(n*nb*sizeof(float));
if(bbs==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
int *ipiv=malloc(n*nb*sizeof(int));
if(ipiv==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
int i,j;
//matrix initialization
double cond=1e3;
for(i=0;i<n;i++){
for(j=0;j<n;j++){
if(j==i){
aaa[i*n+j]=pow(cond,(i+1)/(double)n);
}else{
aaa[i*n+j]=1.9*(rand()/(double)RAND_MAX-0.5)*pow(cond,(i+1)/(double)n)/(double)n;
//aaa[i*n+j]=(rand()/(double)RAND_MAX-0.5)/(double)n;
//aaa[i*n+j]=0;
}
}
bbb[i]=i;
}
for(i=0;i<n;i++){
for(j=0;j<n;j++){
aaas[i*n+j]=aaa[i*n+j];
}
bbbs[i]=bbb[i];
}
int k=0;
int ierr;
//estimating the condition number of the matrix
memcpy(aa,aaa,n*n*sizeof(double));
double anorm;
double rcond;
//anorm=LAPACKE_dlange( LAPACK_ROW_MAJOR, 'i', n,n, aa, n);
double work[n];
anorm=LAPACKE_dlange_work(LAPACK_ROW_MAJOR, 'i', n, n, aa, n, work );
ierr=LAPACKE_dgetrf( LAPACK_ROW_MAJOR, n, n,aa, n,ipiv );
if(ierr<0){LAPACKE_xerbla( "LAPACKE_dgetrf", ierr );}
ierr=LAPACKE_dgecon(LAPACK_ROW_MAJOR, 'i', n,aa, n,anorm,&rcond );
if(ierr<0){LAPACKE_xerbla( "LAPACKE_dgecon", ierr );}
printf("condition number is %g\n",anorm,1./rcond);
//testing dgesv()
clock_t t;
t = clock();
for(k=0;k<nbrun;k++){
memcpy(bb,bbb,n*nb*sizeof(double));
memcpy(aa,aaa,n*n*sizeof(double));
ierr=LAPACKE_dgesv(LAPACK_ROW_MAJOR,n,nb,aa,n,ipiv,bb,nb);
if(ierr<0){LAPACKE_xerbla( "LAPACKE_dgesv", ierr );}
}
//testing sgesv()
t = clock() - t;
printf ("dgesv()x%d took me %d clicks (%f seconds).\n",nbrun,t,((float)t)/CLOCKS_PER_SEC);
t = clock();
for(k=0;k<nbrun;k++){
memcpy(bbs,bbbs,n*nb*sizeof(float));
memcpy(aas,aaas,n*n*sizeof(float));
ierr=LAPACKE_sgesv(LAPACK_ROW_MAJOR,n,nb,aas,n,ipiv,bbs,nb);
if(ierr<0){LAPACKE_xerbla( "LAPACKE_sgesv", ierr );}
}
//testing dsgesv()
t = clock() - t;
printf ("sgesv()x%d took me %d clicks (%f seconds).\n",nbrun,t,((float)t)/CLOCKS_PER_SEC);
int iter;
t = clock();
for(k=0;k<nbrun;k++){
memcpy(bb,bbb,n*nb*sizeof(double));
memcpy(aa,aaa,n*n*sizeof(double));
ierr=LAPACKE_dsgesv(LAPACK_ROW_MAJOR,n,nb,aa,n,ipiv,bb,nb,x,nb,&iter);
if(ierr<0){LAPACKE_xerbla( "LAPACKE_dsgesv", ierr );}
}
t = clock() - t;
printf ("dsgesv()x%d took me %d clicks (%f seconds).\n",nbrun,t,((float)t)/CLOCKS_PER_SEC);
if(iter>0){
printf("iterative refinement has succeded, %d iterations\n");
}else{
printf("iterative refinement has failed due to");
if(iter==-1){
printf(" implementation- or machine-specific reasons\n");
}
if(iter==-2){
printf(" overflow in iterations\n");
}
if(iter==-3){
printf(" failure of single precision factorization sgetrf() (ill-conditionned?)\n");
}
if(iter==-31){
printf(" max number of iterations\n");
}
}
free(aaa);
free(aa);
free(bbb);
free(bb);
free(x);
free(aaas);
free(aas);
free(bbbs);
free(bbs);
free(ipiv);
return 0;
}
Output for n=2000:
condition number is 1475.26
dgesv()x2 took me 5260000 clicks (5.260000 seconds).
sgesv()x2 took me 3560000 clicks (3.560000 seconds).
dsgesv()x2 took me 3790000 clicks (3.790000 seconds).
iterative refinement has succeded, 11 iterations
I have a problem that boils down to performing some arithmetic on each element of a set of matrices. I thought this sounded like the kind of computation that could benefit greatly from being shifted onto the GPU. However, I've only succeeded in slowing down the computation by a factor of 10!
Here are the specifics of my test system:
OS: Windows 10
CPU: Core i7-4700MQ # 2.40 GHz
GPU: GeForce GT 750M (compute capability 3.0)
CUDA SDK: v7.5
The code below performs equivalent calcs to my production code, on the CPU and on the GPU. The latter is consistently ten times slower on my machine (CPU approx. 650ms; GPU approx. 7s).
I've tried changing the grid and block sizes; I've increased and decreased the size of the array passed to the GPU; I've run it through the visual profiler; I've tried integer data rather than doubles, but whatever I do, the GPU version is always significantly slower than the CPU equivalent.
So why is the GPU version so much slower and what changes, that I've not mentioned above, could I try to improve its performance?
Here's my command line: nvcc source.cu -o CPUSpeedTest.exe -arch=sm_30
And here's the contents of source.cu:
#include <iostream>
#include <windows.h>
#include <cuda_runtime_api.h>
void AdjustArrayOnCPU(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
for (size_t i = 0; i < arrayLength; i++)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[i] / denominator);
array[i] = array[i] * adjustmentFactor;
}
}
__global__ void CudaKernel(double factor1, double factor2, double factor3, double denominator, double* array, int arrayLength, double* curve, int curveLength)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < arrayLength)
{
double adjustmentFactor = factor1 * factor2 * factor3 * (curve[idx] / denominator);
array[idx] = array[idx] * adjustmentFactor;
}
}
void AdjustArrayOnGPU(double array[], int arrayLength, double factor1, double factor2, double factor3, double denominator, double curve[], int curveLength)
{
double *dev_row, *dev_curve;
cudaMalloc((void**)&dev_row, sizeof(double) * arrayLength);
cudaMalloc((void**)&dev_curve, sizeof(double) * curveLength);
cudaMemcpy(dev_row, array, sizeof(double) * arrayLength, cudaMemcpyHostToDevice);
cudaMemcpy(dev_curve, curve, sizeof(double) * curveLength, cudaMemcpyHostToDevice);
CudaKernel<<<100, 1000>>>(factor1, factor2, factor3, denominator, dev_row, arrayLength, dev_curve, curveLength);
cudaMemcpy(array, dev_row, sizeof(double) * arrayLength, cudaMemcpyDeviceToHost);
cudaFree(dev_curve);
cudaFree(dev_row);
}
void FillArray(int length, double row[])
{
for (size_t i = 0; i < length; i++) row[i] = 0.1 + i;
}
int main(void)
{
const int arrayLength = 10000;
double arrayForCPU[arrayLength], curve1[arrayLength], arrayForGPU[arrayLength], curve2[arrayLength];;
FillArray(arrayLength, curve1);
FillArray(arrayLength, curve2);
///////////////////////////////////// CPU Version ////////////////////////////////////////
LARGE_INTEGER StartingTime, EndingTime, ElapsedMilliseconds, Frequency;
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForCPU);
AdjustArrayOnCPU(1.0, 1.0, 1.0, 1.0, arrayForCPU, 10000, curve1, 10000);
}
QueryPerformanceCounter(&EndingTime);
ElapsedMilliseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMilliseconds.QuadPart *= 1000;
ElapsedMilliseconds.QuadPart /= Frequency.QuadPart;
std::cout << "Elapsed Milliseconds: " << ElapsedMilliseconds.QuadPart << std::endl;
///////////////////////////////////// GPU Version ////////////////////////////////////////
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (size_t iterations = 0; iterations < 10000; iterations++)
{
FillArray(arrayLength, arrayForGPU);
AdjustArrayOnGPU(arrayForGPU, 10000, 1.0, 1.0, 1.0, 1.0, curve2, 10000);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout << "CUDA Elapsed Milliseconds: " << elapsedTime << std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
And here is an example of the output of CUDASpeedTest.exe
Elapsed Milliseconds: 565
CUDA Elapsed Milliseconds: 7156.76
What follows is likely to be embarrassingly obvious to most developers working with CUDA, but may be of value to others - like myself - who are new to the technology.
The GPU code is ten times slower than the CPU equivalent because the GPU code exhibits a perfect storm of performance-wrecking characteristics.
The GPU code spends most of its time allocating memory on the GPU, copying data to the device, performing a very, very simple calculation (that is supremely fast irrespective of the type of processor it's running on) and then copying data back from the device to the host.
As noted in the comments, if an upper bound exists on the size of the data structures being processed, then a buffer on the GPU can be allocated exactly once and reused. In the code above, this takes the GPU to CPU runtime down from 10:1 to 4:1.
The remaining performance disparity is down to the fact that the CPU is able to perform the required calculations, in serial, millions of times in a very short time span due to its simplicity. In the code above, the calculation involves reading a value from an array, some multiplication, and finally an assignment
to an array element. Something this simple must be performed millions of times
before the benefits of doing so in parallel outweigh the necessary time penalty of transferring the data to the GPU and back. On my test system, a million array elements is the break even point, where GPU and CPU perform in (approximately) the same amount of time.
I have a problem to solve that can be translated into difference logic, and rather than implementing a decision procedure, I would like to use z3 for this purpose.
Nevertheless, I run a few examples and I had exponential-like runtimes (even though there is a polytime decision procedure for it). I am new to z3 and I dont know if I am doing something wrong. Here is the code that I am using (c++ api), varing this "max" variable.
int main(int argc, char **argv) {
context c;
solver s(c, "QF_IDL");
int max = 10000;
int prev = 0;
for(int i = 1; i < max; ++i){
expr x = s.ctx().int_const(std::to_string(i).c_str());
expr y = s.ctx().int_const(std::to_string(++i).c_str());
expr pr = s.ctx().int_const(std::to_string(prev).c_str());
s.add(pr < x);
s.add(x < y);
prev = i;
}
s.add(s.ctx().int_const(std::to_string(max-1).c_str()) < s.ctx().int_const(std::to_string(0).c_str()));
clock_t begin = clock();
switch (s.check()) {
case unsat: std::cout << "UNSAT"; break;
case sat: std::cout << "SAT"; break;
case unknown: std::cout << "unknown\n"; break;
}
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout << "elapsed time: " << elapsed_secs;
}
Many thanks in advance,
Pedro
By default, Z3 uses the simplex engine and sometimes a Floyd Marshall engine to solve your constraints even when the logic is QF_IDL. In this case it uses the simplex engine, and the size of rows grows quadratically for this example.
You can force the sparce difference logic solver by inserting the following into your program:
params p(c);
p.set("auto_config", false);
p.set("smt.arith.solver", (unsigned)1);
solver s(c, "QF_IDL");
s.set(p);
This sets the arithmetic solver to the sparse difference logic solver.
It does not suffer from space overhead. It still takes quadratic time,
or to be more precise time proportional to |V||E| where |V| is the number
of variables and |E| are the number of inequalities.
The main time bottleneck in this case is on big-num arithmetic, which is not
necessary in your case. I added an update to the unstable branch of Z3 so that it recognizes scenarios that only use small integers so that it can use a
more efficient representation. This takes time on the larger examples down by a factor of about 5. Nevertheless, the overhead is still |V||E|.
On Darwin, the POSIX standard clock_gettime(CLOCK_MONOTONIC) timer is not available. Instead, the highest resolution monotonic timer is obtained through the mach_absolute_time function from mach/mach_time.h.
The result returned may be an unadjusted tick count from the processor, in which case the time units could be a strange multiple. For example, on a CPU with a 33MHz tick count, Darwin returns 1000000000/33333335 as the exact units of the returned result (ie, multiply the mach_absolute_time by that fraction to obtain a nanosecond value).
We usually wish to convert from exact ticks to "standard" (decimal) units, but unfortunately, naively multiplying the absolute time by the fraction will overflow even in 64-bit arithmetic. This is an error that Apple's sole piece of documentation on mach_absolute_time falls into (Technical Q&A QA1398).1
How should I write a function that correctly uses mach_absolute_time?
Note that this is not a theoretical problem: the sample code in QA1398 completely fails to work on PowerPC-based Macs. On Intel Macs, mach_timebase_info always returns 1/1 as the scaling factor because the CPU's raw tick count is unreliable (dynamic speed-stepping), so the API does the scaling for you. On PowerPC Macs, mach_timebase_info returns either 1000000000/33333335 or 1000000000/25000000, so Apple's provided code definitely overflows every few minutes. Oops.
Most-precise (best) answer
Perform the arithmetic at 128-bit precision to avoid the overflow!
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
}
uint64_t scale(uint64_t i) {
return scaleHighPrecision(i - bias, tb.numer, tb.denom);
}
static uint64_t scaleHighPrecision(uint64_t i, uint32_t numer,
uint32_t denom) {
U64 high = (i >> 32) * numer;
U64 low = (i & 0xffffffffull) * numer / denom;
U64 highRem = ((high % denom) << 32) / denom;
high /= denom;
return (high << 32) + highRem + low;
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return data.scale(now);
}
A simple low-resolution answer
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
if (tb.denom > 1024) {
double frac = (double)tb.numer/tb.denom;
tb.denom = 1024;
tb.numer = tb.denom * frac + 0.5;
assert(tb.numer > 0);
}
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
A fiddly solution using low-precision arithmetic but using continued fractions to avoid loss of accuracy
// This function returns the rational number inside the given interval with
// the smallest denominator (and smallest numerator breaks ties; correctness
// proof neglects floating-point errors).
static mach_timebase_info_data_t bestFrac(double a, double b) {
if (floor(a) < floor(b))
{ mach_timebase_info_data_t rv = {(int)ceil(a), 1}; return rv; }
double m = floor(a);
mach_timebase_info_data_t next = bestFrac(1/(b-m), 1/(a-m));
mach_timebase_info_data_t rv = {(int)m*next.numer + next.denum, next.numer};
return rv;
}
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count. However, although the bound on the error is
// the same as for the pragmatic answer, the error is actually minimized over
// the given accuracy bound.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
double frac = (double)tb.numer/tb.denom;
uint64_t spanTarget = 315360000000000000llu; // 10 years
if (getExpressibleSpan(tb.numer, tb.denom) >= spanTarget)
return;
for (double errorTarget = 1/1024.0; errorTarget > 0.000001;) {
mach_timebase_info_data_t newFrac =
bestFrac((1-errorTarget)*frac, (1+errorTarget)*frac);
if (getExpressibleSpan(newFrac.numer, newFrac.denom) < spanTarget)
break;
tb = newFrac;
errorTarget = fabs((double)tb.numer/tb.denom - frac) / frac / 8;
}
assert(getExpressibleSpan(tb.numer, tb.denom) >= spanTarget);
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
The derivation
We aim to reduce the fraction returned by mach_timebase_info to one that is essentially the same, but with a small denominator. The size of the timespan that we can handle is limited only by the size of the denominator, not the numerator of the fraction we shall multiply by:
uint64_t getExpressibleSpan(uint32_t numer, uint32_t denom) {
// This is just less than the smallest thing we can multiply numer by without
// overflowing. ceilLog2(numer) = 64 - number of leading zeros of numer
uint64_t maxDiffWithoutOverflow = ((uint64_t)1 << (64 - ceilLog2(numer))) - 1;
return maxDiffWithoutOverflow * numer / denom;
}
If denom=33333335 as returned by mach_timebase_info, we can handle differences of up to 18 seconds only before the multiplication by numer overflows. As getExpressibleSpan shows, by calculating a rough lower bound for this, the size of numer doesn't matter: halving numer doubles maxDiffWithoutOverflow. The only goal therefore is to produce a fraction close to numer/denom that has a smaller denominator. The simplest method to do this is using continued fractions.
The continued fractions method is rather handy. bestFrac clearly works correctly if the provided interval contains an integer: it returns the least integer in the interval over 1. Otherwise, it calls itself recursively with a strictly larger interval and returns m+1/next. The final result is a continued fraction that can be shown by induction to have the correct property: it's optimal, the fraction inside the given interval with the least denominator.
Finally, we reduce the fraction Darwin passes us to a smaller one to use when rescaling the mach_absolute_time to nanoseconds. We may introduce an error here because we can't reduce the fraction in general without losing accuracy. We set ourselves the target of 0.1% error, and check that we've reduced the fraction enough for common timespans (up to ten years) to be handled correctly.
Arguably the method is over-complicated for what it does, but it handles correctly anything the API can throw at it, and the resulting code is still short and extremely fast (bestFrac typically recurses only three or four iterations deep before returning a denominator less than 1000 for random intervals [a,a*1.002]).
You're worrying about overflow when multiplying/dividing with values from the mach_timebase_info struct, which is used for conversion to nanoseconds. So, while it may not fit your exact needs, there are easier ways to get a count in nanoseconds or seconds.
All solutions below are using mach_absolute_time internally (and NOT the wall clock).
Use double instead of uint64_t
(supported in Objective-C and Swift)
double tbInSeconds = 0;
mach_timebase_info_data_t tb;
kern_return_t kError = mach_timebase_info(&tb);
if (kError == 0) {
tbInSeconds = 1e-9 * (double)tb.numer / (double)tb.denom;
}
(remove the 1e-9 if you want nanoseconds)
Usage:
uint64_t start = mach_absolute_time();
// do something
uint64_t stop = mach_absolute_time();
double durationInSeconds = tbInSeconds * (stop - start);
Use ProcessInfo.processInfo.systemUptime
(supported in Objective-C and Swift)
It does the job in double seconds directly:
CFTimeInterval start = NSProcessInfo.processInfo.systemUptime;
// do something
CFTimeInterval stop = NSProcessInfo.processInfo.systemUptime;
NSTimeInterval durationInSeconds = stop - start;
For reference, source code of systemUptime
just does something similar as previous solution:
struct mach_timebase_info info;
mach_timebase_info(&info);
__CFTSRRate = (1.0E9 / (double)info.numer) * (double)info.denom;
__CF1_TSRRate = 1.0 / __CFTSRRate;
uint64_t tsr = mach_absolute_time();
return (CFTimeInterval)((double)tsr * __CF1_TSRRate);
Use QuartzCore.CACurrentMediaTime()
(supported in Objective-C and Swift)
Same as systemUptime, but without being open source.
Use Dispatch.DispatchTime.now()
(supported in Swift only)
Another wrapper around mach_absolute_time(). Base precision is nanoseconds, backed with UInt64.
DispatchTime start = DispatchTime.now()
// do something
DispatchTime stop = DispatchTime.now()
TimeInterval durationInSeconds = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000
For reference, source code of DispatchTime.now() says it basically simply returns a struct DispatchTime(rawValue: mach_absolute_time()). And the calculation for uptimeNanoseconds is:
(result, overflow) = result.multipliedReportingOverflow(by: UInt64(DispatchTime.timebaseInfo.numer))
result = overflow ? UInt64.max : result / UInt64(DispatchTime.timebaseInfo.denom)
So it just discards results if the multiplication can't be stored in an UInt64.
If mach_absolute_time() sets the uint64 back to 0 then reset the time calculations if less than the last check.
That's the problem, they don't document what happens when the uint64 reaches all ones (binary).
read it. https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time
Does anybody knows why vector allocation on device takes too much for the first run being compiled in Debug mode? In my particular case (NVIDIA Quadro 3000M, Cuda Toolkit 6.0, Windows 7, MSVC2010) first run for Debug compiled version takes over 40 seconds, next (no recompilation) runs take 10 times less (vector allocation on device for Release version takes over 1 second).
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <ctime>
int main(void) {
clock_t t;
t = clock();
thrust::host_vector<int> h_vec( 100);
clock_t dt = clock() - t;
printf ("allocation on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::generate(h_vec.begin(), h_vec.end(), rand);
dt = clock() - t;
printf ("initialization on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
dt = clock() - t;
printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
d_vec[0] = h_vec[0];
dt = clock() - t;
printf ("copy one to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
d_vec = h_vec;
dt = clock() - t;
printf ("copy all to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::sort(d_vec.begin(), d_vec.end());
dt = clock() - t;
printf ("sort on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
dt = clock() - t;
printf ("copy to host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
t = clock();
for(int i=0; i<10; i++)
printf("%d\n", h_vec[i]);
dt = clock() - t;
printf ("output - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
std::cin.ignore();
return 0;
}
Most of the time you are measuring for the first vector instantiation isn't the cost of the vector allocation and initialisation, it is overhead costs associated with the CUDA runtime and driver. I would guess that if you changed your code to something like this:
int main(void) {
clock_t t;
....
cudaFree(0); // This forces context establishment and lazy runtime overheads
t = clock();
thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
dt = clock() - t;
printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);
.....
You should see that the time you measure to allocate the vector between first and second runs becomes the same even though the wall clock time to run the program shows a big difference.
I don't have a good explanation as to why there is such a large difference in startup time between first and second runs, but if I was to hazard a guess, it is that there is some driver level JIT recompilation being performed on the first run, and the driver caches the code for subsequent runs. One thing to check is that you are compiling code for the correct architecture for your GPU, that would eliminate driver recompilation as a source of the time difference.
The nvprof utility can provide you with an API trace and timings. You might want to run it and see where in the API call sequence the difference in time is arising from. It isn't beyond the realms of possibility that you are seeing the effects of some sort of driver bug, but without more information it is impossible to say.
It looks like in my case (NVIDIA Quadro 3000M, Cuda Toolkit 6.0, Windows 7, MSVC2010) the problem is solved by changing project CUDA C/C++ / Code Generation option from compute_10,sm_10 to compute_20,sm_20 which states for newer GPU achrchitecture. So I've got happiness for today )