Measuring How Slow Atomic Increments are As Compared to Regular Integer Increments - c++11

A recent discussion made me wonder how expensive an atomic increment is in comparison to a regular integer increment.
I wrote some code to try and benchmark this:
#include <iostream>
#include <atomic>
#include <chrono>
static const int NUM_TEST_RUNS = 100000;
static const int ARRAY_SIZE = 500;
void runBenchmark(std::atomic<int>& atomic_count, int* count_array, int array_size, bool do_atomic_increment){
for(int i = 0; i < array_size; ++i){
++count_array[i];
}
if(do_atomic_increment){
++atomic_count;
}
}
int main(int argc, char* argv[]){
int num_test_runs = NUM_TEST_RUNS;
int array_size = ARRAY_SIZE;
if(argc == 3){
num_test_runs = atoi(argv[1]);
array_size = atoi(argv[2]);
}
if(num_test_runs == 0 || array_size == 0){
std::cout << "Usage: atomic_operation_overhead <num_test_runs> <num_integers_in_array>" << std::endl;
return 1;
}
// Instantiate atomic counter
std::atomic<int> atomic_count;
// Allocate the integer buffer that will be updated every time
int* count_array = new int[array_size];
// Track the time elapsed in case of incrmeenting with mutex locking
auto start = std::chrono::steady_clock::now();
for(int i = 0; i < num_test_runs; ++i){
runBenchmark(atomic_count, count_array, array_size, true);
}
auto end = std::chrono::steady_clock::now();
// Calculate time elapsed for incrementing without mutex locking
auto diff_with_lock = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
std::cout << "Elapsed time with atomic increment for "
<< num_test_runs << " test runs: "
<< diff_with_lock.count() << " ns" << std::endl;
// Track the time elapsed in case of incrementing without a mutex locking
start = std::chrono::steady_clock::now();
for(unsigned int i = 0; i < num_test_runs; ++i){
runBenchmark(atomic_count, count_array, array_size, false);
}
end = std::chrono::steady_clock::now();
// Calculate time elapsed for incrementing without mutex locking
auto diff_without_lock = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
std::cout << "Elapsed time without atomic increment for "
<< num_test_runs << " test runs: "
<< diff_without_lock.count() << " ns" << std::endl;
auto difference_running_times = diff_with_lock - diff_without_lock;
auto proportion = difference_running_times.count() / (double)diff_without_lock.count();
std::cout << "How much slower was locking: " << proportion * 100.0 << " %" << std::endl;
// We loop over all entries in the array and print their sum
// We do this mainly to prevent the compiler from optimizing out
// the loop where we increment all the values in the array
int array_sum = 0;
for(int i = 0; i < array_size; ++i){
array_sum += count_array[i];
}
std::cout << "Array sum (just to prevent loop getting optimized out): " << array_sum << std::endl;
delete [] count_array;
return 0;
}
The trouble I am having is that this program produces widely divergent results in each run:
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 99852 ns
Elapsed time without atomic increment for 1000 test runs: 96396 ns
How much slower was locking: 3.58521 %
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 182769 ns
Elapsed time without atomic increment for 1000 test runs: 138319 ns
How much slower was locking: 32.1359 %
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 98858 ns
Elapsed time without atomic increment for 1000 test runs: 96404 ns
How much slower was locking: 2.54554 %
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 107848 ns
Elapsed time without atomic increment for 1000 test runs: 105174 ns
How much slower was locking: 2.54245 %
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 113865 ns
Elapsed time without atomic increment for 1000 test runs: 100559 ns
How much slower was locking: 13.232 %
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 1000 500
Elapsed time with atomic increment for 1000 test runs: 98956 ns
Elapsed time without atomic increment for 1000 test runs: 106639 ns
How much slower was locking: -7.20468 %
This causes me to believe that there might be a bug in the benchmarking code itself. Is there some error I am missing? Is my usage of std::chrono for benchmarking incorrect? Or is the time difference on account of the overhead to the signal handling in the OS pertaining to atomic operations?
What could I be doing wrong?
Test bed:
Intel® Core™ i7-4700MQ CPU # 2.40GHz × 8
8GB RAM
GNU/Linux:Ubuntu LTS 14.04 (64 bit)
GCC version: 4.8.4
Compilation: g++ -std=c++11 -O3 atomic_operation_overhead.cpp -o atomic_operation_overhead
EDIT: Updated the test run output after compiling with -O3 optimization.
EDIT: After running the tests for an increased number of iterations and adding a loop sum to prevent optimization out of the looped increment as suggested by Adam, I get more convergent results:
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7111974931 ns
Elapsed time without atomic increment for 99999999 test runs: 6938317779 ns
How much slower was locking: 2.50287 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7424952991 ns
Elapsed time without atomic increment for 99999999 test runs: 7262721866 ns
How much slower was locking: 2.23375 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7172114343 ns
Elapsed time without atomic increment for 99999999 test runs: 7030985219 ns
How much slower was locking: 2.00725 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7094552104 ns
Elapsed time without atomic increment for 99999999 test runs: 6971060941 ns
How much slower was locking: 1.77148 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7099907902 ns
Elapsed time without atomic increment for 99999999 test runs: 6970289856 ns
How much slower was locking: 1.85958 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7763604675 ns
Elapsed time without atomic increment for 99999999 test runs: 7229145316 ns
How much slower was locking: 7.39312 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7164534212 ns
Elapsed time without atomic increment for 99999999 test runs: 6994993609 ns
How much slower was locking: 2.42374 %
Array sum (just to prevent loop getting optimized out): 1215751192
balajeerc#Balajee:~/Projects/Misc$ ./atomic_operation_overhead 99999999 500
Elapsed time with atomic increment for 99999999 test runs: 7154697145 ns
Elapsed time without atomic increment for 99999999 test runs: 6997030700 ns
How much slower was locking: 2.25333 %
Array sum (just to prevent loop getting optimized out): 1215751192

Some thoughts:
Run more iterations, at least enough that it takes a few seconds. Your runs take milliseconds, so an IO interrupt could be enough to skew your results.
Print out the sum at the end. The compiler might be smart enough to optimize your loops far more than you think, so your code might be doing less work than you think. If the compiler sees the value is never read, it might delete your loops entirely.
Do the iteration all in one loop, as opposed to a loop that calls a function. While the compiler will probably inline your function call, it's best to not introduce another potential source of noise.
I'm sure you'll do this next, but add a threaded test. Might as well do it for both; you'll get a wrong sum in the non-atomic variable due to races, but at least you'll see the performance penalty you pay for consistency.

Related

omp parallel for loop (reduction to find max) ran slower than serial codes

I am new in using OpenMP.
I think that use max reduction clause to find the max element of an array is not such a bad idea, but in fact the parallel for loop ran much slower than serial one.
int main() {
double sta, end, elapse_t;
int bsize = 46000;
int q = bsize;
int max_val = 0;
double *buffer;
buffer = (double*)malloc(bsize*sizeof(double));
srand(time(NULL));
for(int i=0;i<q;i++)
buffer[i] = rand()%10000;
sta = omp_get_wtime();
//int i;
#pragma omp parallel for reduction(max : max_val)
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("parallel maximum time %f\n", end-sta);
sta = omp_get_wtime();
for(int i=0;i<q; i++)
{
max_val = max_val > buffer[i] ? max_val : buffer[i];
}
end = omp_get_wtime();
printf("serial maximum time %f\n", end-sta);
free(buffer);
return 0;}
Compile command
gcc-7 kp_omp.cpp -o kp_omp -fopenmp
Execution results
./kp_omp
parallel maximum time 0.000505
serial maximum time 0.000266
As for the CPU, it is an Intel Core i7-6700 with 8 cores.
Whenever you parallelise a loop openMP needs to perform some operations, for example creating the threads. Those operations result in some overhead and this in turns implies that, for each loop, there is a minimum number of iterations under which it is not convenient to parallelise.
If I execute your code I obtain the same results you have:
./kp_omp
parallel maximum time 0.000570
serial maximum time 0.000253
However if I modify bsize in line 8 such that
int bsize = 100000;
I obtain
./kp_omp
parallel maximum time 0.000323
serial maximum time 0.000552
So the parallelised version got faster than the sequential. Part of the challenges you encounter when you try to speedup the execution of a code is to understand when it is convenient to parallelise and when the overhead of the parallelisation would kill your expected gain in performance.

Cannot understand the metric returned by "perf" regarding the cache-misses

My question is about understanding the Linux perf tool metrics. I did an optimisations related to prefetch/cache-misses in my code, that is now faster. However, perf does not show me that (or more certainly, I do not understand what perf shows me).
Taking it back to where it all began. I did an investigation in order to speed up random memory access using prefetch.
Here is what my program does:
It uses two int buffers of the same size
It reads one-by-one all the values of the first buffer
each value is a random index in the second buffer
It reads the value at the index in the second buffer
It sums all the values taken from the second buffer
It does all the previous steps for bigger and bigger
At the end, I print the number of voluntary and involuntary CPU context switches
After my last tunings, my code is the following one:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#include <math.h>
#include <sched.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 50000)
#define PADDING 256
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc((BUFFER_SIZE + PADDING) * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
double computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer, unsigned short prefetchStep)
{
double sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
unsigned int index = indexBuffer[i];
unsigned int value = valueBuffer[index];
double s = sin(value);
sum += s;
}
return (sum);
}
unsigned int computeTimeInMicroSeconds(unsigned short prefetchStep)
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer();
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
double sum = computeSum(indexBuffer, valueBuffer, prefetchStep);
gettimeofday(&endTime, NULL);
printf("prefetchStep = %d, Sum = %f - ", prefetchStep, sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
void testWithPrefetchStep(unsigned short prefetchStep)
{
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds(prefetchStep);
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
int iterateOnPrefetchSteps()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
for (unsigned short prefetchStep = 0 ; prefetchStep < 250 ; prefetchStep++)
{
testWithPrefetchStep(prefetchStep);
}
}
void setCpuAffinity(int cpuId)
{
int pid=0;
cpu_set_t mask;
unsigned int len = sizeof(mask);
CPU_ZERO(&mask);
CPU_SET(cpuId,&mask);
sched_setaffinity(pid, len, &mask);
}
int main(int argc, char ** argv)
{
setCpuAffinity(7);
if (argc == 2)
{
testWithPrefetchStep(atoi(argv[1]));
}
else
{
iterateOnPrefetchSteps();
}
}
At the end of my previous stackoverflow question I thought I had all the elements: In order to avoid cache-misses I made my code prefetching data (using __builtin_prefetch) and my program was faster. Everything looked as normal as possible
However, I wanted to study it using the Linux perf tool. So I launched a comparison between two executions of my program:
./TestPrefetch 0: doing so, the prefetch is inefficient because it is done on the data that is read just after (when the data is accessed, it cannot have be loaded in the CPU cache). Run duration: 21.346 seconds
./TestPrefetch 1: Here the prefetch is far more efficient because data is fetched one loop-iteration before it is read. Run duration: 12.624 seconds
The perf outputs are the following ones:
$ gcc -O3 TestPrefetch.c -o TestPrefetch -lm && for cpt in 0 1; do echo ; echo "### Step=$cpt" ; sudo perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./TestPrefetch $cpt; done
### Step=0
prefetchStep = 0, Sum = -1107.523504 - Time: 21346278 micro-seconds = 21.346 seconds
Performance counter stats for './TestPrefetch 0':
24387,010283 task-clock (msec) # 1,000 CPUs utilized
97 274 163 155 cycles # 3,989 GHz
59 183 107 508 instructions # 0,61 insn per cycle
425 300 823 cache-references # 17,440 M/sec
249 261 530 cache-misses # 58,608 % of all cache refs
24,387790203 seconds time elapsed
### Step=1
prefetchStep = 1, Sum = -1107.523504 - Time: 12623665 micro-seconds = 12.624 seconds
Performance counter stats for './TestPrefetch 1':
15662,864719 task-clock (msec) # 1,000 CPUs utilized
62 159 134 934 cycles # 3,969 GHz
59 167 595 107 instructions # 0,95 insn per cycle
484 882 084 cache-references # 30,957 M/sec
321 873 952 cache-misses # 66,382 % of all cache refs
15,663437848 seconds time elapsed
Here, I have difficulties to understand why am I better:
The number of cache-misses is almost the same (I even have a little bit more): I can't understand why and (overall) if so, why am I faster?
what are the cache-references?
what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?
I wouldn't trust the perf summary, as it's not very clear what each name represents and which perf counter are they programmed to follow. The default settings have also been known to count the wrong things (see - https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/557604)
What could happen here is that your cache miss counter may count also the prefetch instructions (which may seem as loads to the machine, especially as you descend in the cache hierarchy). In that case, having more cache references (lookups) makes sense, and you would expect these requests to be misses (the whole point of a prefetch is to miss...).
Instead of relying on some ambiguous counter, find our the counter IDs and masks for your specific machine that represent demand reads lookups and misses, and see if they improved.
Edit: looking at your numbers again, I see an increase of ~50M accesses, but ~70M misses. It's possible that there are more misses due to cache thrashing done by the prefetches
can't understand why and (overall) if so, why am I faster?
Because you run more instructions per the time. The old one:
0,61 insn per cycle
and the new one
0,95 insn per cycle
what are the cache-references?
The count how many times the cache was asked if it does contain the data you were loading/storing.
what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?
Yes. But note that in today processors, there is no wait for any of this. The instructions are executed out-of-order, usually prefetched and if the next instruction needs some data that are not ready, other instructions will get executed.
I recently progress on my perf issues. I discovered a lot of new events among which some are really interesting.
Regarding the current problem, the following event have to be concidered: L1-icache-load-misses
When I monitor my test-application with perf in the same conditions than previously, I get the following values for this event:
1 202 210 L1-icache-load-misses
against
530 127 L1-icache-load-misses
For the moment, I do not yet understand why cache-misses events are not impacted by prefetches while L1-icache-load-misses are...

Clock_t returns big negative numbers

I am trying to measure the running time in nanoseconds of a function by using clock() function, like this:
clock_t start_t = clock();
random_function();
clock_t end_t = clock();
printf("Elapsed time: %f\n", (double)(end_t - start_t) / CLOCKS_PER_SEC);
When I run the program I sometimes get very big negative numbers as a result.
Why does this happen?
If your platform us using unsigned for clock_t you can use modulo std::numeric_limits<clock_t>::max() subtraction, instead of the absolute one

What caused my elapsed time much longer than user time?

I am benchmarking some R statements (see details here) and found that my elapsed time is way longer than my user time.
user system elapsed
7.910 7.750 53.916
Could someone help me to understand what factors (R or hardware) determine the difference between user time and elapsed time, and how I can improve it? In case it helps: I am running data.table data manipulation on a Macbook Air 1.7Ghz i5 with 4GB RAM.
Update: My crude understanding is that user time is what it takes my CPU to process my job. elapsed time is the length from I submit a job until I get the data back. What else did my computer need to do after processing for 8 seconds?
Update: as suggested in the comment, I run a couple times on two data.table: Y, with 104 columns (sorry, I add more columns as time goes by), and X as a subset of Y with only 3 keys. Below are the updates. Please note that I ran these two procedures consecutively, so the memory state should be similar.
X<- Y[, list(Year, MemberID, Month)]
system.time(
{X[ , Month:= -Month]
setkey(X,Year, MemberID, Month)
X[,Month:=-Month]}
)
user system elapsed
3.490 0.031 3.519
system.time(
{Y[ , Month:= -Month]
setkey(Y,Year, MemberID, Month)
Y[,Month:=-Month]}
)
user system elapsed
8.444 5.564 36.284
Here are the size of the only two objects in my workspace (commas added). :
object.size(X)
83,237,624 bytes
object.size(Y)
2,449,521,080 bytes
Thank you
User time is how many seconds the computer spent doing your calculations. System time is how much time the operating system spent responding to your program's requests. Elapsed time is the sum of those two, plus whatever "waiting around" your program and/or the OS had to do. It's important to note that these numbers are the aggregate of time spent. Your program might compute for 1 second, then wait on the OS for one second, then wait on disk for 3 seconds and repeat this cycle many times while it's running.
Based on the fact that your program took as much system time as user time it was a very IO intensive thing. Reading from disk a lot or writing to disk a lot. RAM is pretty fast, a few hundred nanoseconds usually. So if everything fits in RAM elapsed time is usually just a little bit longer than user time. But disk might take a few milliseconds to seek and even longer to reply with the data. That's slower by a factor of of a million.
We've determined that your processor was "doing stuff" for ~8 + ~8 = ~ 16 seconds. What was it doing for the other ~54 - ~16 = ~38 seconds? Waiting for the hard drive to send it the data it asked for.
UPDATE1:
Matthew had made some excellent points that I'm making assumptions that I probably shouldn't be making. Adam, if you'd care to publish a list of all the rows in your table (datatypes are all we need) we can get a better idea of what's going on.
I just cooked up a little do-nothing program to validate my assumption that time not spent in userspace and kernel space is likely spent waiting for IO.
#include <stdio.h>
int main()
{
int i;
for(i = 0; i < 1000000000; i++)
{
int j, k, l, m;
j = 10;
k = i;
l = j + k;
m = j + k - i + l;
}
return 0;
}
When I run the resulting program and time it I see something like this:
mike#computer:~$ time ./waste_user
real 0m4.670s
user 0m4.660s
sys 0m0.000s
mike#computer:~$
As you can see by inspection the program does no real work and as such it doesn't ask the kernel to do anything short of load it into RAM and start it running. So nearly ALL the "real" time is spent as "user" time.
Now a kernel-heavy do-nothing program (with a few less iterations to keep the time reasonable):
#include <stdio.h>
int main()
{
FILE * random;
random = fopen("/dev/urandom", "r");
int i;
for(i = 0; i < 10000000; i++)
{
fgetc(random);
}
return 0;
}
When I run that one, I see something more like this:
mike#computer:~$ time ./waste_sys
real 0m1.138s
user 0m0.090s
sys 0m1.040s
mike#computer:~$
Again it's easy to see by inspection that the program does little more than ask the kernel to give it random bytes. /dev/urandom is a non-blocking source of entropy. What does that mean? The kernel uses a pseudo-random number generator to quickly generate "random" values for our little test program. That means the kernel has to do some computation but it can return very quickly. So this program mostly waits for the kernel to compute for it, and we can see that reflected in the fact that almost all the time is spent on sys.
Now we're going to make one little change. Instead of reading from /dev/urandom which is non-blocking we'll read from /dev/random which is blocking. What does that mean? It doesn't do much computing but rather it waits around for stuff to happen on your computer that the kernel developers have empirically determined is random. (We'll also do far fewer iterations since this stuff takes much longer)
#include <stdio.h>
int main()
{
FILE * random;
random = fopen("/dev/random", "r");
int i;
for(i = 0; i < 100; i++)
{
fgetc(random);
}
return 0;
}
And when I run and time this version of the program, here's what I see:
mike#computer:~$ time ./waste_io
real 0m41.451s
user 0m0.000s
sys 0m0.000s
mike#computer:~$
It took 41 seconds to run, but immeasurably small amounts of time on user and real. Why is that? All the time was spent in the kernel, but not doing active computation. The kernel was just waiting for stuff to happen. Once enough entropy was collected the kernel would wake back up and send back the data to the program. (Note it might take much less or much more time to run on your computer depending on what all is going on). I argue that the difference in time between user+sys and real is IO.
So what does all this mean? It doesn't prove that my answer is right because there could be other explanations for why you're seeing the behavior that you are. But it does demonstrate the differences between user compute time, kernel compute time and what I'm claiming is time spent doing IO.
Here's my source for the difference between /dev/urandom and /dev/random:
http://en.wikipedia.org/wiki//dev/random
UPDATE2:
I thought I would try and address Matthew's suggestion that perhaps L2 cache misses are at the root of the problem. The Core i7 has a 64 byte cache line. I don't know how much you know about caches, so I'll provide some details. When you ask for a value from memory the CPU doesn't get just that one value, it gets all 64 bytes around it. That means if you're accessing memory in a very predictable pattern -- like say array[0], array[1], array[2], etc -- it takes a while to get value 0, but then 1, 2, 3, 4... are much faster. Until you get to the next cache line, that is. If this were an array of ints, 0 would be slow, 1..15 would be fast, 16 would be slow, 17..31 would be fast, etc.
http://software.intel.com/en-us/forums/topic/296674
In order to test this out I've made two programs. They both have an array of structs in them with 1024*1024 elements. In one case the struct has a single double in it, in the other it's got 8 doubles in it. A double is 8 bytes long so in the second program we're accessing memory in the worst possible fashion for a cache. The first will get to use the cache nicely.
#include <stdio.h>
#include <stdlib.h>
#define MANY_MEGS 1048576
typedef struct {
double a;
} PartialLine;
int main()
{
int i, j;
PartialLine* many_lines;
int total_bytes = MANY_MEGS * sizeof(PartialLine);
printf("Striding through %d total bytes, %d bytes at a time\n", total_bytes, sizeof(PartialLine));
many_lines = (PartialLine*) malloc(total_bytes);
PartialLine line;
double x;
for(i = 0; i < 300; i++)
{
for(j = 0; j < MANY_MEGS; j++)
{
line = many_lines[j];
x = line.a;
}
}
return 0;
}
When I run this program I see this output:
mike#computer:~$ time ./cache_hits
Striding through 8388608 total bytes, 8 bytes at a time
real 0m3.194s
user 0m3.140s
sys 0m0.016s
mike#computer:~$
Here's the program with the big structs, they each take up 64 bytes of memory, not 8.
#include <stdio.h>
#include <stdlib.h>
#define MANY_MEGS 1048576
typedef struct {
double a, b, c, d, e, f, g, h;
} WholeLine;
int main()
{
int i, j;
WholeLine* many_lines;
int total_bytes = MANY_MEGS * sizeof(WholeLine);
printf("Striding through %d total bytes, %d bytes at a time\n", total_bytes, sizeof(WholeLine));
many_lines = (WholeLine*) malloc(total_bytes);
WholeLine line;
double x;
for(i = 0; i < 300; i++)
{
for(j = 0; j < MANY_MEGS; j++)
{
line = many_lines[j];
x = line.a;
}
}
return 0;
}
And when I run it, I see this:
mike#computer:~$ time ./cache_misses
Striding through 67108864 total bytes, 64 bytes at a time
real 0m14.367s
user 0m14.245s
sys 0m0.088s
mike#computer:~$
The second program -- the one designed to have cache misses -- it took five times as long to run for the exact same number of memory accesses.
Also worth noting is that in both cases, all the time spent was spent in user, not sys. That means that the OS is counting the time your program has to wait for data against your program, not against the operating system. Given these two examples I think it's unlikely that cache misses are causing your elapsed time to be substantially longer than your user time.
UPDATE3:
I just saw your update that the really slimmed down table ran about 10x faster than the regular-sized one. That too would indicate to me that (as another Matthew also said) you're running out of RAM.
Once your program tries to use more memory than your computer actually has installed it starts swapping to disk. This is better than your program crashing, but its much slower than RAM and can cause substantial slowdowns.
I'll try and put together an example that shows swap problems tomorrow.
UPDATE4:
Okay, here's an example program which is very similar to the previous one. But now the struct is 4096 bytes, not 8 bytes. In total this program will use 2GB of memory rather than 64MB. I also change things up a bit and make sure that I access things randomly instead of element-by-element so that the kernel can't get smart and start anticipating my programs needs. The caches are driven by hardware (driven solely by simple heuristics) but it's entirely possible that kswapd (the kernel swap daemon) could be substantially smarter than the cache.
#include <stdio.h>
#include <stdlib.h>
typedef struct {
double numbers[512];
} WholePage;
int main()
{
int memory_ops = 1024*1024;
int total_memory = memory_ops / 2;
int num_chunks = 8;
int chunk_bytes = total_memory / num_chunks * sizeof(WholePage);
int i, j, k, l;
printf("Bouncing through %u MB, %d bytes at a time\n", chunk_bytes/1024*num_chunks/1024, sizeof(WholePage));
WholePage* many_pages[num_chunks];
for(i = 0; i < num_chunks; i++)
{
many_pages[i] = (WholePage*) malloc(chunk_bytes);
if(many_pages[i] == 0){ exit(1); }
}
WholePage* page_list;
WholePage* page;
double x;
for(i = 0; i < 300*memory_ops; i++)
{
j = rand() % num_chunks;
k = rand() % (total_memory / num_chunks);
l = rand() % 512;
page_list = many_pages[j];
page = page_list + k;
x = page->numbers[l];
}
return 0;
}
From the program I called cache_hits to cache_misses we saw the size of memory increased 8x and execution time increased 5x. What do you expect to see when we run this program? It uses 32x as much memory than cache_misses but has the same number of memory accesses.
mike#computer:~$ time ./page_misses
Bouncing through 2048 MB, 4096 bytes at a time
real 2m1.327s
user 1m56.483s
sys 0m0.588s
mike#computer:~$
It took 8x as long as cache_misses and 40x as long as cache_hits. And this is on a computer with 4GB of RAM. I used 50% of my RAM in this program versus 1.5% for cache_misses and 0.2% for cache_hits. It got substantially slower even though it wasn't using up ALL the RAM my computer has. It was enough to be significant.
I hope this is a decent primer on how to diagnose problems with programs running slow.

Performance hit in CUDA program that calls kernel repeatedly within a for loop

I have a CUDA program that calls the kernel repeatedly within a for loop. The
code computes all rows of a matrix by using the values computed in the previous one
until the entire matrix is done. This is basically a dynamic programming algorithm.
The code below fills the (i,j) entry of many separate matrices in parallel with
the kernel.
for(i = 1; i <=xdim; i++){
for(j = 1; j <= ydim; j++){
start3time = clock();
assign5<<<BLOCKS, THREADS>>>(Z, i, j, x, y, z)
end3time = clock();
diff = static_cast<double>(end3time-start3time)/(CLOCKS_PER_SEC / 1000);
printf("Time for i=%d j=%d is %f\n", i, j, diff);
}
}
The kernel assign5 is straightforward
__global__ void assign5(float* Z, int i, int j, int x, int y, int z) {
int id = threadIdx.x + blockIdx.x * blockDim.x;
char ch = database[j + id];
Z[i+id] = (Z[x+id] + Z[y+id] + Z[z+id])*dev_matrix[i][index[ch - 'A']];
}
}
My problem is that when I run this program the time for each i and j is 0 most of the
time but sometimes it is 10 milliseconds. So the output looks like
Time for i=0 j=0 is 0
Time for i=0 j=1 is 0
.
.
Time for i=15 j=21 is 10
Time for i=15 j=22 is 0
.
I don't understand why this is happening. I don't see a thread race condition. If I add
if(i % 20 == 0) cudaThreadSynchronize();
right after the first loop then the Time for i and j is mostly 0. But then the time
for sync is sometimes 10 or even 20. It seems like CUDA is performing many operations
at low cost and then charges a lot for later ones. Any help would be appreciated.
I think you have a misconception about what a kernel call in CUDA actually does on the host. A kernel call is non-blocking and is only added to the device's queue. If you're measuring time before and after your kernel call, then the difference has nothing to do with how long your kernel call takes (it would measure the time it takes to add the kernel call to the queue).
You should add a cudaThreadSynchronize() after every kernel call and before you measure end3time. cudaThreadSynchronize() blocks and returns if all kernels in the queue have finished their work.
This is why
if(i % 20 == 0) cudaThreadSynchronize();
made spikes in your measurments.

Resources