How to calculate Arithmetic Intensity? - performance

I have the following code snippet, of which I have to calculate the Arithmetic Intensity.
const int N = 8192;
float a[N], b[N], c[N], d[N];
...
#pragma omp parallel for simd
for(int i = 0; i < N; i++)
{
const float tmp_a = a[i];
const float tmp_b = b[i];
c[i] += tmp_a*tmp_b;
d[i] = tmp_a+tmp_b;
}
Case 1 : What will be the AI if tmp_a and tmp_b are in registers
Case 2 : What will be the AI if tmp_a and tmp_b are in RAM or cache
I know AI is given as number of floating point operations divided by the number of bytes transferred. How should the bytes transferred depend on the data being stored in RAM/registers/Cache? What additional information do we need to calculate the maximum floating point throughput achievable by the code?

Related

OpenMP reduction on SSE2 vector

I want to compute the average of an image (3 channels of interest + 1 alpha channel we ignore here) for each channel using SSE2 intrinsics. I tried that:
__m128 average = _mm_setzero_ps();
#pragma omp parallel for reduction(+:average)
for(size_t k = 0; k < roi_out->height * roi_out->width * ch; k += ch)
{
float *in = ((float *)temp) + k;
average += _mm_load_ps(in);
}
But I get this error with GCC: user-defined reduction not found for average.
Is that possible with SSE2 ? What's wrong ?
Edit
This works:
float sum[4] = { 0.0f };
#pragma omp parallel for simd reduction(+:sum[:4])
for(size_t k = 0; k < roi_out->height * roi_out->width * ch; k += ch)
{
float *in = ((float *)temp) + k;
for (int i = 0; i < ch; ++i) sum[i] += in[i];
}
const __m128 average = _mm_load_ps(sum) / ((float)roi_out->height * roi_out->width);
You can user-define a custom reduction like this:
#pragma omp declare reduction \
(addps:__m128:omp_out+=omp_in) \
initializer(omp_priv=_mm_setzero_ps())
And then use it like:
#pragma omp parallel for reduction(addps:average)
for(size_t k = 0; k < size * ch; k += ch)
{
average += _mm_loadu_ps(data+k);
}
I think, most importantly, openmp needs to know how to get a neutral element (here _mm_setzero_ps()) for your reduction.
Full working example: https://godbolt.org/z/Fpqttc
Interesting link: http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-reduction.html#User-definedreductions

Speed up a C++ code implementing numerical integration

I have this C++ code that implements a rectangular numerical integration
#include <iostream>
#include <cmath>
using namespace std;
float pdf(float u){
return (1/(pow(1+u, 2)));
}
float cdf(float u){
return (1 - 1/(u+1));
}
// The main function that implements the numerical integration,
//and it is a recursive function
float integ(float h, int k, float du){
float res = 0;
if (k == 1){
res = cdf(h);
}else{
float u = 0;
while (u < h){
res += integ(h - u, k - 1, du)*pdf(u)*du;
u += du;
}
}
return res;
}
int main(){
float du = 0.0001;
int K = 3;
float gamma[4] = {0.31622777, 0.79432823,
1.99526231, 5.01187234};
int G = 50;
int Q = 2;
for (int i = 0; i < 4; i++){
if ((G-Q*(K-1)) > 0){
float gammath = (gamma[i]/Q)*(G-Q*(K-1));
cout<<1-integ(gammath, K, du)<< endl;
}
}
return 0;
}
I am facing a speed problem, although I switched to C++ from Python and MATLAB, because C++ is faster. The problem is that I need a small step size du to get an accurate evaluation of the integration.
Basically, I want to evaluate the integral at 4 different points defined by gammath, which is a function of other defined parameters.
Is there anyway I can speed up this program? I already have 25x+ speed factor over the same code in Python, but still the code takes too long (I ran it all night, and it wasn't finished in the morning). And this is only for K=3, and G=50. In other cases I want to test K = 10, and G = 100 or 300.
Thanks in advance for any tips.
What is behind your computation is that you take the K-fold convolution power of the pdf function and then integrate that power from 0 to h. As you use Riemann sums for integration, it means that you treat the pdf as a step function with steps of width du. In that case, the values of the convolution power can be computed as the coefficients in the power of a (truncated) power series/generating function
p(z)=pdf(0)+pdf(du)*z+pdf(2*du)*z^2+...+pdf(n*du)*z^n
where n*du>h. You can now compute this power via FFT based algorithms. A more basic variant uses that if q(z)=p(z)^K mod z^(n+1) then
p(z)*q'(z) = K*q(z)*p'(z) mod z^n
so that the coefficients of q can be computed via convolution sums from the coefficients p[j]=pdf(j*du) of p. Comparing the terms for the power z^(m-1) in the above formula gives on the coefficient level
sum p[m-j]*j*q[j] = K * sum q[j]*(m-j)*p[m-j], j=0..m
or solved for the new coefficient q[m] when the previous coefficients q[0],...,q[m-1] are already computed:
q[m] = 1/(m*p[0]) * sum (K*(m-j)-j)*p[m-j]*q[j], j=0..m-1
In code that gives
q[0] = pow(p[0], K);
for(m=1; m<=n; m++) {
q[m]=0;
for(j=0; j<m; j++) { q[m] += (K*(m-j)-j)*p[m-j]*q[j]; }
q[m] /= m*p[0];
}
and then sum up for the result,
res = q[0];
for(j=1; j*du < h; j++) { res += q[j]; }
res *= pow(du, K);

Minimum compression time for a range of files

A bit algorithmic problem, or may be optimization one, or Dynamic Programming.
Let's say we have N files to compress.
The average compression ratio is L.
The compression time of a file depends on two factors -
1. Size of the file currently being processed, and
2. Memory space left in system (Total = M, occupied = sum of file size of compressed and uncompressed files)
So
t(i) = K * s(i) / (M-L*(s(1)+s(2)+....+s(i))-(s(i+1) + s(i+2) + .....+ s(n))
where s(i) is the size of ith file and t(i) is the time taken to compress ith file.
What I have to do, is to calculate the optimal series of the files to be compressed so that total time required is minimum. So how to compute that series?
It seems that the best approach is to sort files by size and process it. This greedy approach may be explained as "compress small file first to avoid compressing it after big file".
Possible approvement is:
if we have two files A,B such that size(A) <= size(B) we can prove that time
t(A,B) <= t(B,A)
A/M + B/(M - L*A) <= B/M + B/(M - L*B)
A*(1/M - 1/(M - L*B)) <= B*(1/M - 1/(M - L*A))
B/A >= (1/M - 1/(M - L*B)) / (1/M - 1/(M - L*A)) = B*(M - L*A) / (A*(M - L*B))
1 >= (M - L*A)/(M - L*B)
-L*B >= -L*A
B >= A
so that mean first equation was right too (if didn't failed somewhere :D)
Sorting give us the guarantee of A < B for every pair of files.
I wrote O(N!) bruteforce for N <= 10. And it gives sorted arrays for every test I can think about.
test : N, L, M, K and N files
8 0.5 80.0 1.0
7 1 6 3 4 5 6 5
result :
0.515769
1 3 4 5 5 6 6 7
#include <iostream>
#include <algorithm>
using namespace std;
// will work bad for cnt > 10 because 10! = 3628800
int perm[] = {0,1,2,3,4,5,6,7,8,9};
int bestPerm[10];
double sizes[10];
double calc(int cnt, double L, double M, double K, double T) {
double res = 0.0, usedMemory = 0.0;
for(int i = 0; i < cnt; i++) {
int ind = perm[i];
res += K * sizes[ind] / (M - L * usedMemory - (T - usedMemory));
usedMemory += sizes[ind];
}
return res;
}
int main() {
int cnt;
double L,M,K,T = 0.0;
cin >> cnt >> L >> M >> K;
for(int i = 0; i < cnt; i++)
cin >> sizes[i], T += sizes[i];
double bruteRes = 1e16;
int bruteCnt = 1;
for(int i = 2; i <= cnt; i++)
bruteCnt *= i;
for(int i = 0; i < bruteCnt; i++) {
double curRes = calc(cnt, L, M, K, T);
if( bruteRes > curRes ) {
bruteRes = curRes;
for(int j = 0; j < cnt; j++)
bestPerm[j] = perm[j];
}
next_permutation(perm, perm + cnt);
}
cout << bruteRes << "\n";
for(int i = 0; i < cnt; i++)
cout << sizes[bestPerm[i]] << " ";
cout << "\n";
return 0;
}
Updated Implementation for case when L is different for all files pastebin (it seems that bruteforce prefer to sort them by descending order of compression ratio L[i] and use the smaller files first, if L is equal).
Suppose you have a schedule that claims to be optimal. Consider any file and the one processed just after it. If you could improve the schedule by swapping them, it couldn't be optimal. So if you can show that it is always best to process a small file before a large one when the two are side by side then you can show that the best schedule is in sorted order with the smallest files first, because you can improve any other schedule.
Because you are just swapping two adjacent files the times taken to process files before and after these two are not changed - the same amount of memory is available before and after. You might as well scale the problem so that one of the files is of size one unit. Supposing that you have a total of K units of memory free before the first file, and supposing the second file is of size x units with a compression ratio of 1:L you end up with something like 1/K + x/(K+L) - x/K - 1/(K - xL) as the difference in compression times due to this pair of files - my algebra is horribly error-prone, but I think this boils down to something like L^2x(1-x) over something complicated but positive, which shows that for a pair of files you always want to compress the short one first, so by what I said earlier the best schedule is in sorted order with the shortest file first.

select a group of pairs in order to minimize rms of group

Simplified problem
I have ~40 resistors (all the same value +-5%) and I need to select 12 of them so that they are as similar as possible.
Solution: I list them in order and take the 12 consecutive with the smallest RMS.
The actual problem
I have ~40 resistors (all the same value +-5%) and I have to choose 12 pairs of them so that the resistance of the pairs is as similar as possible.
Notes
The resistance of the pair (R1,R2) is R1+R2.
I do not really care about the programming language, but let's say that I'm looking for a solution in C++ or Python, the two languages I'm most familiar with.
This gives reasonably good results (in MATLAB)
a = ones(40,1) + rand(40,1)*0.1-0.05; % The resistors
vec = zeros(40,2); % Initialize matrix
indices = zeros(40,2); % Initialize matrix
a = sort(a); % Sort vector of resistors
for ii = 1:length(a)
vec(ii,:) = [a(ii) a(ii)]; % Assign resistor values to row ii of vec
indices(ii,:) = [ii,ii]; % Corresponding resistor number (index)
for jj = 1:length(a)
if sum(abs((a(ii)+a(jj))-2*mean(a))) < abs(sum(vec(ii,:))-2*mean(a))
vec(ii,:) = [a(ii) a(jj)]; % Check if the new set is better than the
indices(ii,:) = [ii, jj]; % previous, and update vec and indices if true.
end
end
end
[x, idx] = sort(sum(vec')'); % Sort the sum of the pairs
final_list = indices(idx); % The indices of the sorted pairs
This is the result when I plot it:
This is not optimal but should give somewhat decent results. It's very fast though so if you ever need to choose 1000 pairs out of 10000 resistors...
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#define GROUPS 12
#define N 40
int compare (const void * a, const void * b)
{
return ( *(int*)a - *(int*)b );
}
int main ()
{
// generate random numbers
float *values = (float *)malloc(sizeof(float) * N);
srand(time(0));
for (int i = 0; i < N; i++)
values[i] = 950 + rand()%101;
qsort(values, N, sizeof(float), compare);
// find "best" pairing
float bestrms = -1;
int beststart = -1;
float bestmean = -1;
for (int start = 0; start <= N - 2 * GROUPS; start++)
{
float sum = 0;
for (int i = start; i < start + 2 * GROUPS; i++)
sum += values[i];
float mean = sum / GROUPS;
float square = 0;
for (int i = 0; i < GROUPS; i++)
{
int x = start + 2 * GROUPS - 1 - i;
float first = values[start + i];
// in a sorted sequence of 24 resistors, always pair 1st with 24th, 2nd with 23rd, etc
float second = values[start + 2 * GROUPS - 1 - i];
float err = mean - (first + second);
square += err * err;
}
float rms = sqrt(square/GROUPS);
if (bestrms == -1 || rms < bestrms)
{
bestrms = rms;
beststart = start;
bestmean = mean;
}
}
for (int i = 0; i < GROUPS; i++)
{
float first = values[beststart + i];
float second = values[beststart + 2 * GROUPS - 1 - i];
float err = bestmean - (first + second);
printf("(%f, %f) %f %f\n", first, second, first + second, err);
}
printf("mean %f rms %f\n", bestmean, bestrms);
free(values);
}
Sort them and then pair 1 with 2, 3 with 4, 5 with 6 and so on. Find the difference between each pair and sort again, choosing the 12 with the least difference.
sort them by resistance
pair 1 with 40, 2 with 39 etc, compute R1+R2 for each pair and pick the best set of 12 pairs (needs another sorting step). compute the mean of all select (R1+R2).
try to refine this initial solution successively by trying to plug in one of the remaining 16 resistors for one of the 24 chosen ones. an attempt would be successful if combined resistance of the new pair is closer to the mean than the combined resistance of the old pair. repeat this step until you can't find any further improvement.
this solution will definitely not always compute the optimal solution but it might be good enough. another idea would be simulated annealing but that would be a lot more work and still not guarantee to find the best solution.

A Cache Efficient Matrix Transpose Program?

So the obvious way to transpose a matrix is to use :
for( int i = 0; i < n; i++ )
for( int j = 0; j < n; j++ )
destination[j+i*n] = source[i+j*n];
but I want something that will take advantage of locality and cache blocking. I was looking it up and can't find code that would do this, but I'm told it should be a very simple modification to the original. Any ideas?
Edit: I have a 2000x2000 matrix, and I want to know how can I change the code using two for loops, basically splitting the matrix into blocks that I transpose individually, say 2x2 blocks, or 40x40 blocks, and see which block size is most efficient.
Edit2: The matrices are stored in column major order, that is to say for a matrix
a1 a2
a3 a4
is stored as a1 a3 a2 a4.
You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < n; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*n];
}
}
}
}
An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.
Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.
I had the exact same problem yesterday.
I ended up with this solution:
void transpose(double *dst, const double *src, size_t n, size_t p) noexcept {
THROWS();
size_t block = 32;
for (size_t i = 0; i < n; i += block) {
for(size_t j = 0; j < p; ++j) {
for(size_t b = 0; b < block && i + b < n; ++b) {
dst[j*n + i + b] = src[(i + b)*p + j];
}
}
}
}
This is 4 time faster than the obvious solution on my machine.
This solution takes care of a rectangular matrix with dimensions which are not a multiple of the block size.
if dst and src are the same square matrix an in place function should really be used instead:
void transpose(double*m,size_t n)noexcept{
size_t block=0,size=8;
for(block=0;block+size-1<n;block+=size){
for(size_t i=block;i<block+size;++i){
for(size_t j=i+1;j<block+size;++j){
std::swap(m[i*n+j],m[j*n+i]);}}
for(size_t i=block+size;i<n;++i){
for(size_t j=block;j<block+size;++j){
std::swap(m[i*n+j],m[j*n+i]);}}}
for(size_t i=block;i<n;++i){
for(size_t j=i+1;j<n;++j){
std::swap(m[i*n+j],m[j*n+i]);}}}
I used C++11 but this could be easily translated in other languages.
Instead of transposing the matrix in memory, why not collapse the transposition operation into the next operation you're going to do on the matrix?
Steve Jessop mentioned a cache oblivious matrix transpose algorithm.
For the record, I want to share an possible implementation of a cache oblivious matrix transpose.
public class Matrix {
protected double data[];
protected int rows, columns;
public Matrix(int rows, int columns) {
this.rows = rows;
this.columns = columns;
this.data = new double[rows * columns];
}
public Matrix transpose() {
Matrix C = new Matrix(columns, rows);
cachetranspose(0, rows, 0, columns, C);
return C;
}
public void cachetranspose(int rb, int re, int cb, int ce, Matrix T) {
int r = re - rb, c = ce - cb;
if (r <= 16 && c <= 16) {
for (int i = rb; i < re; i++) {
for (int j = cb; j < ce; j++) {
T.data[j * rows + i] = data[i * columns + j];
}
}
} else if (r >= c) {
cachetranspose(rb, rb + (r / 2), cb, ce, T);
cachetranspose(rb + (r / 2), re, cb, ce, T);
} else {
cachetranspose(rb, re, cb, cb + (c / 2), T);
cachetranspose(rb, re, cb + (c / 2), ce, T);
}
}
}
More details on cache oblivious algorithms can be found here.
Matrix multiplication comes to mind, but the cache issue there is much more pronounced, because each element is read N times.
With matrix transpose, you are reading in a single linear pass and there's no way to optimize that. But you can simultaneously process several rows so that you write several columns and so fill complete cache lines. You will only need three loops.
Or do it the other way around and read in columns while writing linearly.
With a large matrix, possibly a large sparse matrix, it might be an idea to decompose it into smaller cache friendly chunks (Say, 4x4 sub matrices). You can also flag sub matrices as identity which will help you in creating optimized code paths.

Resources