Paradoxical VTune Amplifier microarchitecture exploration results - performance

I am trying to optimize a sin/cos approximation function. At its core there is a simple Horner scheme consisting of a bunch of multiplies and adds. Compiler is MSVC from VS2017, processor is Intel Xeon E5-1650, hyperthreading is on (but observations are basically identical if it is off).
Using Intel's VTune Amplifier 2019, I have obtained profiling results (release build, of course) for > 1 min of running the function on random doubles (between -2 pi and 2 pi), with ~40% of clockticks spent in the section shown below (the rest is range reduction + test harness). However, I cannot make sense of the microarchitectural metrics that VTune presents me:
(MSVC's source line attribution after inlining is awful.)
Here is the corresponding C++ code that got inlined:
void stableSinCosApproximation(double x, double* sinApprox, double* cosApprox)
{
double x2 = x * x;
*sinApprox = x * (sinCoeff[7] + x2 * (sinCoeff[6] + x2 * (sinCoeff[5] + x2 * (sinCoeff[4] + x2 * (sinCoeff[3] + x2 * (sinCoeff[2] + x2 * (sinCoeff[1] + x2 * sinCoeff[0])))))));
*cosApprox = (cosCoeff[7] + x2 * (cosCoeff[6] + x2 * (cosCoeff[5] + x2 * (cosCoeff[4] + x2 * (cosCoeff[3] + x2 * (cosCoeff[2] + x2 * (cosCoeff[1] + x2 * cosCoeff[0])))))));
}
Clearly, the assembly listing has only one continuous block of instructions. No jumps (and no jump targets), no branching or conditional execution at all. Yet, there are multiple metrics here whose values I cannot make sense of with the information provided by VTune's inbuilt or online help.
Concrete question:
The second half of the code has almost no attribution, clockticks and all. Why?
The first half has ever-increasing CPI rate. Ok, maybe this and the previous point are due to something about the attribution going wrong, but I don't get it.
The metrics say that there is bad speculation. But upon expanding that column, it shows neither branch mispredicts nor machine clears:
What is this supposed to tell me? In what capacity does the CPU speculate here?
I also allegedly lose a good chunk of uops to being front-end bound. Is the correlation to the bad speculation column only coincidence? What should I do with this information?
Preemptive notes:
The point of reimplementing this is guaranteed consistency across multiple platforms (from the same binary). The inbuilt sin/cos functions can vary by a few ULP across machines, which can kill reproducability of results.
Yes, I know about FMAs, but not every platforms that this (single) binary has to run on provides them. I'm not going for run-time dispatches at the moment.

Can you show screenshots from the VTune's Bottom-up pane instead of assembly pane? It is interesting to see characterization for the whole function (e.g. values of Bad Speculation and Front-End Bound, counts for BR_MISP_RETIRED.ALL_BRANCHES_PS and MACHINE_CLEARS.COUNT events).

Related

TensorFlow weight initialization taking 99% of total run time

Problem
I noticed that one of my networks was running slow and nvidia-smi was reporting only around ~10% GPU usage. After running the profiler, I saw that TruncatedNormal process was taking the vast majority of running time (see photo). What could causing this kind of problem?
Code
Weight declaration function (from MNIST tutorial):
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
Code in action:
# First Layer
with tf.name_scope('input'):
x = tf.placeholder(tf.float32, [None, Nvars])
w1 = weight_variable([Nvars, 8])
b1 = bias_variable([8])
y1 = tf.nn.relu(tf.matmul(x, w1) + b1)
The issue is that the tensorflow functions that you are using have multiplicative properties. When a CPU needs to invoke the product handler, performance degrades exponentially.
The problem may be avoided if the calculation is pre-supported by the kernel. What you need to do is use compiler optimization; i.e. change some compiler settings which will cause the compiler to 'ease' the multiplication.
Or... you could just not multiply, and just add a few times.
Hope I helped.

Eigen, get inverse failed, and inverse() is so slow

I want to get inverse matrix, so here is the code:
if (m.rows() == m.fullPivLu().rank())
{
res = m.inverse();
}
The dimension of m and res are all 5000 times 5000. And When I run the code on a high performance computing machine(Linux, Tianhe 2 SuperComputer), the process was kill at res = m.inverse();, and there was no core file or dump information generated. The console return killed and the process exited.
But there is nothing wrong on my Ubuntu laptop.
And the inverse()'s performance is poor and it costs a lot time.
So, why inverse() was killed on the high performance machine? Thank you!
Full-pivoting LU is known to be very slow, regardless of its implementation.
Better use PartialPivLU, which benefits from high performance matrix-matrix operations. Then to get the best of Eigen, use the 3.3-beta2 release and compile with both FMA (-mfma) and OpenMP (e.g., -fopenmp) supports, and don't forget to enable compiler optimizations -O3. This operation should not take more than a few seconds.
Finally, do you really need to explicitly compute the inverse? If you only apply it to some vectors or matrices (i.e., A^-1 * B or B * A^-1) then better apply the inverse in factorized form rather than explicitly computing it. With Eigen 3.3:
MatrixXd A = ...;
PartialPivLU<MatrixXd> lu(A);
x = lu.inverse() * b; // solve Ax=b, same as x = lu.solve(b);
x = b * lu.inverse(); // solve xA=b
In these expressions, the inverse is not explicitly computed!

Solving large system of coupled differential equations

I have a system of coupled ordinary differential equations
dx/dt = (A + C_d(t) * B) * x,
where A and B are constant matrices and C_d is a diagonal coefficient matrix which smoothly varies depending on the current value of the integration variable.
The square matrices A and B are built up from smaller 60*60 upper triangular or zero matrices. The dimension of the full system is around 2500*2500. A and B are sparse with ~10% non-zero elements. The diagonal elements are negative or zero. The main (physical) constraint is that elements of x(t) are not allowed to become negative during integration.
Currently, I employ a ‘naïve’ step solver
x_(i+1) = A * x_i * dt_i + B * (C_d(t_i) * x_i) * dt_i + x_i
or in the CPU/GPU versions
def solve_CPU(nsteps, dt, c_d, x):
for step in xrange(nsteps):
x += (A.dot(x) + B.dot(x * c_d[step])) * dt[step]
def solve_GPU(m, n, nsteps, dt, c_d, cu_curr_x, cu_delta_x, cu_A, cu_B):
for step in xrange(nsteps):
cubl.gemv(trans='T', m=m, n=n, alpha=1.0, A=cu_A,
x=cu_curr_x, beta=0.0, y=cu_delta_x)
cubl.gemv(trans='T', m=m, n=n, alpha=c_d[step], A=cu_B,
x=cu_curr_x, beta=1.0, y=cu_delta_x)
cubl.axpy(alpha=dt[step], x=cu_delta_x, y=cu_curr_x)
and make use of a feature, that the step sizes dt_ithis can be computed a priory in a way that the elements of x are always >=0 during integration. Depending on the amount of approximations and the settings the number of integration steps varies between 25k and 10M.
I have tried several methods to optimize performance on general purpose hardware:
(unknown) When using ODEPACK’s VODE solver, I do not know how to express the x>=0 constraint
(slowest) Dense BLAS 2 dot-product using Intel MKL
(medium) Dense BLAS using single precision cuBLAS on NVIDIA GPU
(fastest) SCIPY sparse module using CSR/CSC formats
The code is written in Python and has access to the above listed libraries via Anaconda, Numba, Accelerate, Numpy etc. SCIPY's sparse BLAS routines are not properly linked to MKL in Anaconda and Python wrappers around cuSPARSE are to my knowledge not available, yet. I would know how to squeeze out a little bit more performance by directly interfacing to cuSPARSE/C-MKL sparse dot product, but that’s it. This exercise has to be solved dozens of times, again and again if models change, so performance is always an issue. I’m not an expert in this matter, so I don’t know much about preconditioners, factorization theorems etc. what brings me to my question:
Is there a more elegant or better way to solve such a linear-algebra task?

How to speed up GLM estimation?

I am using RStudio 0.97.320 (R 2.15.3) on Amazon EC2. My data frame has 200k rows and 12 columns.
I am trying to fit a logistic regression with approximately 1500 parameters.
R is using 7% CPU and has 60+GB memory and is still taking a very long time.
Here is the code:
glm.1.2 <- glm(formula = Y ~ factor(X1) * log(X2) * (X3 + X4 * (X5 + I(X5^2)) * (X8 + I(X8^2)) + ((X6 + I(X6^2)) * factor(X7))),
family = binomial(logit), data = df[1:150000,])
Any suggestions to speed this up by a significant amount?
There are a couple packages to speed up glm fitting. fastglm has benchmarks showing it to be even faster than speedglm.
You could also install a more performant BLAS library on your computer (as Ben Bolker suggests in comments), which will help any method.
Although a bit late but I can only encourage dickoa's suggestion to generate a sparse model matrix using the Matrix package and then feeding this to the speedglm.wfit function. That works great ;-) This way, I was able to run a logistic regression on a 1e6 x 3500 model matrix in less than 3 minutes.
Assuming that your design matrix is not sparse, then you can also consider my package parglm. See this vignette for a comparison of computation times and further details. I show a comparison here of computation times on a related question.
One of the methods in the parglm function works as the bam function in mgcv. The method is described in detail in
Wood, S.N., Goude, Y. & Shaw S. (2015) Generalized additive models for large datasets. Journal of the Royal Statistical Society, Series C 64(1): 139-155.
On advantage of the method is that one can implement it with non-concurrent QR implementation and still do the computation in parallel. Another advantage is a potentially lower memory footprint. This is used in mgcv's bam function and could also be implemented here with a setup as in speedglm's shglm function.

TickCount() deprecated in Mac OS X 10.8

I am using TickCount() to determine the time difference between events or time required to run a certain piece of code. But it is deprecated in OS X 10.8.
Therefore, I needed an alternative for the same.
If you want to measure absolute time, use gettimeofday(). This gives you the date, e.g., "Thu Nov 22 07:48:52 UTC 2012". This is not always suitable for measuring differences between events because the time reported by gettimeofday() can jump forwards or backwards if the user changes the clock.
If you want to measure relative time, mach_absolute_time(). This lets you measure the difference between two events, e.g., "15.410 s". This does not give absolute times, but is always monotonic.
If you want to measure CPU time, use clock(). This is often but not always the way you measure the performance of a piece of code. It doesn't count time spent on IO, or impact on system speed, so it should only be used when you know you are measuring something CPU bound.
I'm surprised that TickCount() wasn't deprecated earlier. It's really an OS 9 and earlier thing.
While this API may not be suitable for new development, if you find yourself in need of an identical API, it can be re-implemented as follows:
uint32_t TickCount() {
uint64_t mat = mach_absolute_time();
uint32_t mul = 0x80d9594e;
return ((((0xffffffff & mat) * mul) >> 32) + (mat >> 32) * mul) >> 23;
}
The above implementation was created through analysis of /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/CarbonCore.framework/Versions/A/CarbonCore, and was briefly unit-tested against the deprecated TickCount with LLDB by altering the registers returned by mach_absolute_time.

Resources