It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
What is overhead? Are there multiple types of overhead, or only one? What are some examples?
The business meaning of overhead cost explains it best. From Wikipedia:
The term overhead is usually used to
group expenses that are necessary to
the continued functioning of the
business, but cannot be immediately
associated with the products/services
being offered1 (e.g. do not directly
generate profits).
Overhead is a "cost" you will incur to be able to perform an operation; you need to "invest" some resource to perform the operation in question.
Overhead is any usage of a particular resource that's a side-effect of what you're actually trying to achieve. e.g. Struct padding is a form of memory overhead. Pushing and popping arguments on the stack is a form of processing overhead. Packet headers are a form of bandwidth overhead. Think of a resource, it can have an overhead associated with it.
Here's an example of size overhead for structs and classes:
struct first {
char letter1;
int number;
char letter2;
};
struct second {
int number;
char letter1;
char letter2;
};
int main ()
{
cout << "Size of first: " << sizeof(first) << endl;
cout << "Size of second: " << sizeof(second) << endl;
return 0;
}
The result is:
Size of first: 12
Size of second: 8
The compiler must build a struct to be word-aligned. In the first struct, the surrounding char's (one byte each) cause the compiler to "push" the int down so that it can be accessed as a full word (four bytes). The second struct doesn't require nearly as much pushing.
Moral of the story: place similar-size data members next to each other.
Here's an example of time overhead, related to better use of locality to exploit the cache:
#include <stdio.h>
#define SIZE 1024
double A[SIZE][SIZE], B[SIZE][SIZE], C[SIZE][SIZE];
int main ()
{
int i, j, k;
for (i = 0; i < SIZE; i++) {
for (j = 0; j < SIZE; j++) {
for (k = 0; k < SIZE; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
return 0;
}
Running this on my machine takes this much time:
real 0m35.137s
user 0m34.996s
sys 0m0.067s
Now I'll swap the j and k loop iterations:
#include <stdio.h>
#define SIZE 1024
double A[SIZE][SIZE], B[SIZE][SIZE], C[SIZE][SIZE];
int main ()
{
int i, j, k;
for (i = 0; i < SIZE; i++) {
for (k = 0; k < SIZE; k++) { // this is the only change
for (j = 0; j < SIZE; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
return 0;
}
The runtime for this is:
real 0m5.489s
user 0m5.436s
sys 0m0.040s
It's much faster because the loop iterations are more in-line with the order of the array indices. Thus, the data is more likely to be accessed consecutively, and thus more likely to be available in cache.
Related
Although it is known that using nested std::vector to represent matrices is a bad idea, let's use it for now since it is flexible and many existing functions can handle std::vector.
I thought, in small cases, the speed difference can be ignored. But it turned out that vector<vector<double>> is 10+ times slower than numpy.dot().
Let A and B be matrices whose size is sizexsize. Assuming square matrices is just for simplicity. (We don't intend to limit discussion to the square matrices case.) We initialize each matrix in a deterministic way, and finally calculate C = A * B.
We define "calculation time" as the time elapsed just to calculate C = A * B. In other words, various overheads are not included.
Python3 code
import numpy as np
import time
import sys
if (len(sys.argv) != 2):
print("Pass `size` as an argument.", file = sys.stderr);
sys.exit(1);
size = int(sys.argv[1]);
A = np.ndarray((size, size));
B = np.ndarray((size, size));
for i in range(size):
for j in range(size):
A[i][j] = i * 3.14 + j
B[i][j] = i * 3.14 - j
start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);
C++ code
using namespace std;
#include <iostream>
#include <vector>
#include <chrono>
int main(int argc, char **argv) {
if (argc != 2) {
cerr << "Pass `size` as an argument.\n";
return 1;
}
const unsigned size = atoi(argv[1]);
vector<vector<double>> A(size, vector<double>(size));
vector<vector<double>> B(size, vector<double>(size));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
A[i][j] = i * 3.14 + j;
B[i][j] = i * 3.14 - j;
}
}
auto start = chrono::system_clock::now();
vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
cerr << scientific;
cerr.precision(3);
cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "\n";
}
C++ code (multithreaded)
We also wrote a multithreaded version of C++ code since numpy.dot() is automatically calculated in parallel.
You can get all the codes from GitHub.
Result
C++ version is 10+ times slower than Python 3 (with numpy) version.
matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
C++ (1 thread): 8.66e-03
C++ (2 threads): 4.68e-03
C++ (3 threads): 3.14e-03
C++ (4 threads): 2.43e-03
Python 3: 4.07e-04
-----------------------------------------------
matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
C++ (1 thread): 6.985e-02
C++ (2 threads): 3.647e-02
C++ (3 threads): 2.462e-02
C++ (4 threads): 1.915e-02
Python 3: 1.466e-03
-----------------------------------------------
Question
Is there any way to make the C++ implementation faster?
Optimizations I Tried
swap calculation order -> at most 3.5 times faster (not than numpy code but than C++ code)
optimization 1 plus partial unroll -> at most 4.5 times faster, but this can be done only when size is known in advance No. As pointed out in this comment, size is not needed to be known. We can just limit the max value of loop variables of unrolled loops and process remaining elements with normal loops. See my implementation for example.
optimization 2, plus minimizing the call of C[i][j] by introducing a simple variable sum -> at most 5.2 times faster. The implementation is here. This result implies std::vector::operator[] is un-ignorably slow.
optimization 3, plus g++ -march=native flag -> at most 6.2 times faster (By the way, we use -O3 of course.)
Optimization 3, plus reducing the call of operator [] by introducing a pointer to an element of A since A's elements are sequentially accessed in the unrolled loop. -> At most 6.2 times faster, and a little little bit faster than Optimization 4. The code is shown below.
g++ -funroll-loops flag to unroll for loops -> no change
g++ #pragma GCC unroll n -> no change
g++ -flto flag to turn on link time optimizations -> no change
Block Algorithm -> no change
transpose B to avoid cache miss -> no change
long linear std::vector instead of nested std::vector<std::vector>, swap calculation order, block algorithm, and partial unroll -> at most 2.2 times faster
Optimization 1, plus PGO(profile-guided optimization) -> 4.7 times faster
Optimization 3, plus PGO -> same as Optimization 3
Optimization 3, plus g++ specific __builtin_prefetch() -> same as Optimization 3
Current Status
(originally) 13.06 times slower -> (currently) 2.10 times slower
Again, you can get all the codes on GitHub. But let us cite some codes, all of which are functions called from the multithreaded version of C++ code.
Original Code (GitHub)
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
const unsigned j_max = B[0].size();
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int j = 0; j < j_max; ++j) {
for (int k = 0; k < k_max; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
Current Best Code (GitHub)
This is the implementation of the Optimization 5 above.
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
static const unsigned num_unroll = 5;
const unsigned j_max = B[0].size();
const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
for (int j = 0; j < j_max; ++j) {
const double *p = A[i].data() + k;
double sum;
sum = *p++ * B[k][j];
sum += *p++ * B[k+1][j];
sum += *p++ * B[k+2][j];
sum += *p++ * B[k+3][j];
sum += *p++ * B[k+4][j];
C[i][j] += sum;
}
}
for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
const double a = A[i][k];
for (int j = 0; j < j_max; ++j) {
C[i][j] += a * B[k][j];
}
}
}
}
We've tried many optimizations since we first posted this question. We spent whole two days struggling with this problem, and finally reached the point where we have no more idea how to optimize the current best code. We doubt more complex algorithms like Strassen's will do it better since cases we handle are not large and each operation on std::vector is so expensive that, as we've seen, just reducing the call of [] improved the performance well.
We (want to) believe we can make it better, though.
Matrix multiplication is relativly easy to optimize. However if you want to get to decent cpu utilization it becomes tricky because you need deep knowledge of the hardware you are using. The steps to implement a fast matmul kernel are the following:
Use SIMDInstructions
Use Register Blocking and fetch multiple data at once
Optimize for your chache lines (mainly L2 and L3)
Parallelize your code to use multiple threads
Under this linke is a very good ressource, that explains all the nasty details:
https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
If you want more indepth advise leave a comment.
In the textbook Computer Systems: a Programmer's Perspective there are some impressive benchmarks for optimizing row-major order access.
I created a small program to test for myself if a simple change from row-major access to column-major access would make a huge difference on my own machine.
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#define N 30000
int a[N][N] = { 0 };
int main() {
srand(time(NULL));
int sum = 0;
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
a[i][j] = rand() % 99;
}
}
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
sum += a[i][j];
}
}
}
On average row-major order access took 8.42s (n=5 trials) on my system whereas column-major order access took 30.12s (n=5 trials) on my system which is pretty significant.
It seems like on the surface like it should be a pretty simple thing to optimize.
Why don't modern compilers optimize these scenarios?
Most loops do not consist of simple sum statements, but have side effects and dependencies between loop iterations.
Not all operations you may do in a loop are commutative, so the optimizer would have to actually understand all the operations happening as part of a loop to ensure it doesn't change its meaning, including the contents of any system API called, code in dynamically loaded libraries, etc.
Now this is just a guess, but I expect someone tried it out, realized that the optimization didn't have enough information about the code being run to trigger most times, and then went to focus on parallel execution optimizations, which are probably the greater optimization opportunities in most codebases.
I would like to know if these 2 codes are the same for performance respecting the variable declarations:
int Value;
for(int i=0; i<1000; i++)
{ Value = i;
}
or
for(int i=0; i<1000; i++)
{ int Value = i;
}
Basically I need to know if the process time to create the variable Value and allocate it in Ram is just once in the first case and if it is, or not, repeated 1000 times in the second.
If you are Programming in c++ or c#, there will be no runtime difference since no implicite initialization will be done for simple int type.
Here is my question about openacc.
I read the APIs (v1 and v2), and the behavior of nested data environment with different subparts of the same array is unclear to me.
Code example:
#pragma acc data pcopyin(a[0:20])
{
#pragma acc data pcopyin(a[100:20])
{
#pragma acc parallel loop
for(i=0; i<20; i++)
a[i] = i;
a[i+100] = i;
}
}
My understanding is that this should work (or at leaste the two acc data parts):
The first pragma checks if a[0,20] is on the accelerator
NO -> data are allocated on the device and transferred
The second pragma checks if a[100,120] is on the accelerator
The pointer a is on the accelerator, but not the data from a[100,120]
The data are allocated on the device and transferred
I tried this kind of thing with CAPS compiler (v3.3.0 which is the only available right now on my test machine), and the second pragma acc data returns me an error (my second subarray don't have the correct shape).
So what happens with my test (I suppose) is that the pointer "a" was found on the accelerator, but the shape associated with it ([0:20]) is not the same in my second pragma ([100:20]).
Is this the normal behavior planned in the API, or should my example work?
Moreover, if this is supposed to work, is there some sort of coherence between the subparts of the same array (somehow, they will be positionned like on the host and I will be able to put a[i] += a[100+i] in my kernel)?
The present test will be looking if "a" is on the device. Hence, when the second data region is encountered, "a" is already on the device but only partially. Instead, a better method would be to add a pointer to point into "a" and reference this pointer on the device. Something like:
#include <stdio.h>
int main () {
int a[200];
int *b;
int i;
for(i=0; i<200; i++) a[i] = 0;
b=a+100;
#pragma acc data pcopy(a[0:20])
{
#pragma acc data pcopy(b[0:20])
{
#pragma acc parallel loop
for(i=0; i<20; i++) {
a[i] = i;
b[i] = i;
}
}
}
for(i=0; i<22; i++) printf("%d = %d \n", i, a[i]);
for(i=100; i<122; i++) printf("%d = %d \n", i, a[i]);
return 0;
}
If you had just copied "a[100:20]", then accessing outside this range would be considered a programmer error.
Hope this helps,
Mat
I have written the following simple C++ code.
#include <iostream>
#include <omp.h>
int main()
{
int myNumber = 0;
int numOfHits = 0;
cout << "Enter my Number Value" << endl;
cin >> myNumber;
#pragma omp parallel for reduction(+:numOfHits)
for(int i = 0; i <= 100000; ++i)
{
for(int j = 0; j <= 100000; ++j)
{
for(int k = 0; k <= 100000; ++k)
{
if(i + j + k == myNumber)
numOfHits++;
}
}
}
cout << "Number of Hits" << numOfHits << endl;
return 0;
}
As you can see I use OpenMP to parallelize the outermost loop. What I would like to do is to rewrite this small code in CUDA. Any help will be much appreciated.
Well, I can give you a quick tutorial, but I won't necessarily write it all for you.
So first of all, you will want to get MS Visual Studio set up with CUDA, which is easy following this guide: http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/
Now you will want to read The NVIDIA CUDA Programming Guide (free pdf), documentation, and CUDA by Example (A book I highly recommend for learning CUDA).
But let's say you haven't done that yet, and definitely will later.
This is an extremely arithmetic heavy and data-light computation - actually it can be computed without this brute force method fairly simply, but that isn't the answer you are looking for. I suggest something like this for the kernel:
__global__ void kernel(int* myNumber, int* numOfHits){
//a shared value will be stored on-chip, which is beneficial since this is written to multiple times
//it is shared by all threads
__shared__ int s_hits = 0;
//this identifies the current thread uniquely
int i = (threadIdx.x + blockIdx.x*blockDim.x);
int j = (threadIdx.y + blockIdx.y*blockDim.y);
int k = 0;
//we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
for(; i < 100000; i += blockDim.x*gridDim.x){
for(; j < 100000; j += blockDim.y*gridDim.y){
//Thanks to talonmies for this simplification
if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
//you should actually use atomics for this
//otherwise, the value may change during the 'read, modify, write' process
s_hits++;
}
}
}
//synchronize threads, so we now s_hits is completely updated
__syncthreads();
//again, atomics
//we make sure only one thread per threadblock actually adds in s_hits
if(threadIdx.x == 0 && threadIdx.y == 0)
*numOfHits += s_hits;
return;
}
To launch the kernel, you will want something like this:
dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);
I know you probably want a quick way to do this, but getting into CUDA isn't really a 'quick' thing. As in, you will need to do some reading and some setup to get it working; past that, the learning curve isn't too high. I haven't told you anything about memory allocation yet, so you will need to do that (although that is simple). If you followed my code, my goal is that you had to read up a bit on shared memory and CUDA, and so you are already kick-started. Good luck!
Disclaimer: I haven't tested my code, and I am not an expert - it could be idiotic.