I'm trying to calculate the execution time of an application. Assuming the only stall penalty occurs on memory access instructions (100 cycles being the penalty).
How am I supposed to find out execution time in seconds with this info?
CPI (CPUCycles?) = 1.0
ClockRate = 1GHZ
TotalInstructions = 59880
MemoryAccessInstructions = 8467
CacheMissRate = 62% (0.62) (5290/8467)
CacheHits = 3117
CacheMisses = 5290
CacheMissPenalty = 100 (cycles)
Assuming no other penalties.
totalCycles = TotalInstructions + CacheMisses * CacheMissPenalty ?
I assume that cache hits cost same as other opcodes, so those are included in TotalInstructions.
That's then 588880 cycles, 1GHz is 1000000000 cycles per second.
So that code will take 0.58888ms to execute (5.8888e-7 second).
This value is of course purely theoretical estimate, as modern CPU doesn't work like that (1 instruction = 1 cycle). If you are interested in real world values, just profile it.
Related
Given:
Operation time required by:
memory units: 200 ps
ALU and adders: 100 ps
Register file: 50 ps
other units and wires: no delay
Instruction mix and operation time in ps:
25% loads (600 ps)
10% stores ( 550 ps)
45% ALU instructions (400 ps)
15% branches (350 ps)
5% jumps (200 ps)
Every instruction executes in 1 clock cycle
Two implementations: fixed length and variable length
Which implementation would be faster and by how much?
Solution
Reference Table
Rule: CPU execution time: IC * CPI * CCT
Since CPI = 1...
CPU execution time: IC * CCT
My questions are:
What does it mean when an implementation has variable / fixed length?
How were the values for CPU execution timesingleclock calculated?
What does it mean when an implementation has variable / fixed length?
Fixed length clock means that each clock cycle has the same period, irrespective of the instruction being execution. Variable length clock means that different clock cycles may have different periods, depending on the instruction being executed.
So in a fixed clock design, the clock cycle has to be at least 600 ps, which is the longest time any instruction would take to execute (the load instruction). In a variable clock design, we can calculate the average clock cycle as follows:
Average CPU clock cycle = 600*25% + 550*10% + 400*45% + 350*15% + 200*5% = 447.5 ps
How were the values for CPU execution timesingleclock
calculated?
To determine which implementation is faster, you need to measure speedup, which is defined as:
Speedup = CPU execution time(single) / CPU execution time(variable)
Using the definition of CPU execution time we get (note that the number of instructions is the same):
Speedup = CPU execution time(single) / CPU execution time(variable)
= (Instruction count * Clock cycle time(single)) / (Instruction count * Clock cycle time(variable))
= Clock cycle time(single) / Clock cycle time(variable)
= 600 / 447.5 = 1.34
Se the variable clock design is 1.34 faster.
Regarding CPU execution timevariable
CPU execution timevariable is technically equal to the sum of the individual clock cycle times of each executed instruction. But we used the average clock cycle time instead to calculate speedup. Will we get the same result either way? Let's find out!
Assume there are N executed instructions and let C1, C2, ..., CN denote the cycle times of each of them, respectively. Hence:
CPU execution time(variable) = C1 + C2 + ... + CN
= 600*25%*N + 550*10%*N + 400*45%*N + 350*15%*N + 200*5%*N
= N * average CPU clock cycle
So they are the same.
I've been noticing super-slow execution times for essentially all of my CUDA kernels on some machine (Fedora 24, GeForce Titan X maxwell), but not on others. Edit: I previously gave the CUDA vectorAdd sample as an MCVE, but due to doubts regarding whether that should really be memory-bottlenecked due to the low workload per thread; so, here's a hand-unrolling of that kernel:
enum { serialization_factor = 8 };
__global__ void vectorAdd(
const float* __restrict__ lhs,
const float* __restrict__ rhs,
float* __restrict__ result,
int length)
{
int pos = threadIdx.x + blockIdx.x * blockDim.x * serialization_factor;
if (length - pos >= blockDim.x * serialization_factor) {
#pragma unroll
for(int i = 0; i < serialization_factor; i++) {
result[pos] = lhs[pos] + rhs[pos];
pos += blockDim.x;
}
}
else {
for(; pos < length; pos += blockDim.x) {
result[pos] = lhs[pos] + rhs[pos];
}
}
}
... and suppose we run this for 5,000,000 elements; and launch the kernel twice, ignoring the first run.
Well, with my home GPU, a Geforce GTX 650 Ti Boost, I get 527 usec. This is a bit strange - I was expecting something like 555 usec, by bandwidth calculations: 3004 MHz clock * 192-bit bus = 72096 MB = 72 GB/sec , and 2 * 4 bytes per float * 5M of data. But it's pretty close so let's ignore the difference. The profiler tells me the "Global Load Throughput" is 72.355 GB/sec.
Now, on the Maxwell Titan X at work, I get 232 usec. That's about twice as fast - but the GPU's bandwidth is 5 times as high as my home GPU: ~336 GB/sec. I should be seeing something like 120 usec. And - the profiler tells me the "Global Load Throughput" is 343.271 GB/sec (!)
How could this be happening?
Notes:
If you think I've gotten something wrong with the kernel, please comment about that rather than writing an answer.
The Titan doesn't have ECC on.
Your bandwidth calculations are not fully accurate. The specified theoretical peak memory bandwidth of the GTX 650 Ti BOOST is twice as high (144.2 GB/s) as you calculated because of double data rate transfer (transfer of separate words on both the raising and the falling edge of the clock signal). The achieved bandwidth in the vector add example is 50% higher than you calculated, because writing back of results to memory also needs to be taken into account. This means your GTX 650 Ti BOOST measurements achieved ~79% of it's theoretical peak bandwidth.
The Titan X's specified peak memory bandwidth is 336.5 GB/s, so your test achieved ~77% of theoretical peak memory bandwidth.
This is about as good as it gets. The remaining discrepancy is due to overhead like memory refresh, the time needed to switch the transfer direction etc.
Adding some to tera's answer, your algorithm has a warm-up and cool-down phase: when having many requests in flight, latency gets hidden indeed, but at the cost of a warm-up, and a cool-down for the last iterations.
If your scheduling is good, you will have a work chunk of 2048 (max resident threads per sm) x 24 (number of sm on the GTX Titan X). Each of which will operate on 8 values. Hence, your work chunk is 393,216 entries.
For your 5,000,000 size sample, it results in 12.7 iterations (13 with the last being incomplete). The warm-up/cool-down cost is 1 iteration.
Depending on scheduling of threads (and this is not necessarily predictable), you may run 14 iterations total; for which you could have 5,111,808 entries at approximately same cost (still one warm-up/cool-down). That size would provide you the best performance I believe.
As a result, the incomplete iteration plus warm-up/cool-down could cost about 10% of performance, the achieved bandwidth being closer to 85% of peak if not more.
The minimal run time of a kernel should also be looked at as it might account for a few micros as well. Running on various data sizes should mitigate this point.
Last but not least, the memory frequency might be modifiable with nvidia-smi, as explained here.
the first level(L1) has hit rate 600 psec, miss rate 10% and miss penalnty 80 nsec. I add a second level cache(L2) with hit rate 5 nsec. I am trying to find th e maximum miss rate for the second level,considering that the combination of the caches (L1 + L2) has double efficiency than the one-level cache L1.
I am using these forms
Average memory access time = Hit time (L1) + Miss rate (L1) x Miss penalty (L1)
Miss penalty (L1) = Hit time (L2) + Miss rate (L2) x Miss penalty (L2)
the solution i get is 40 %,but the correct answer is 9,25 %.
can anyone help?
THANKS IN ADVANCE
avg = 8.6 = 0.6 + 0.1*80
1/2*avg = 4.3 = 0.6 + 0.1*(5 + x*80)
=> 3.2 = x*8
=> x = 0.4
So, it seems that you answer is correct under assumptions that
- "Average memory access time" does not include any other time value for various secondary effects;
- Double efficiency means that it takes a half time in average.
The following two code snippets perform the same task (generating M samples uniformly from an N-dim sphere). I was wondering why the latter one consumes much more time than the previous one.
%% MATLAB R2014a
M = 30;
N = 10000;
#1
tic
S = zeros(M, N);
for k = 1:M
P = ones(1, N);
for i = 1:N - 1
t = rand*2*pi;
P(1:i) = P(1:i)*sin(t);
P(i+1) = P(i+1)*cos(t);
end
S(k,:) = P;
end
toc
#2
tic
S = ones(M, N);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(k, 1:i) = S(k, 1:i)*sin(t);
S(k, i+1) = S(k, i+1)*cos(t);
end
end
toc
The output is:
Elapsed time is 15.007667 seconds.
Elapsed time is 59.745311 seconds.
And I also tried M = 1,
Elapsed time is 0.463370 seconds.
Elapsed time is 1.566913 seconds.
#2 is nearly 4 times slower than #1. Is frequent 2d element accessing in #2 making it time-consuming?
The time difference is due to memory access patterns, and how well they map onto the cache. And also possibly to MATLAB's exploitation of your hardware vector unit (SSE/AVX). MATLAB stores matrices "column-major", meaning S(2,1) is next to S(1,1).
In #1, you process each sample using the vector P, which lives in contiguous memory. These 80,000 bytes fit easily in L2 cache for the fast repeated access you need to perform. They're also neighbors, and trivially vectorized (I'm not certain if MATLAB performs this optimization, but I'd hope so...)
In #2, you access a row of S at a time, which is not contiguous, but rather is interleaved by M values. So each row is spread across 30*80,000 bytes, which does not fit in L2 cache. It'll have to be read back in for each repeated access, even though you're ignoring 29/30 values in that data.
Here's the test. All I'm doing it transposing S so that you can process a column at a time instead, then putting it back at the end just to get the same result:
#3
tic
S = ones(N, M);
for k = 1:M
for i = 1:N - 1
t = rand*2*pi;
S(1:i, k) = S(1:i, k)*sin(t);
S(i+1, k) = S(i+1, k)*cos(t);
end
end
S = S.';
toc
Results:
Elapsed time is 11.254212 seconds.
Elapsed time is 45.847750 seconds.
Elapsed time is 11.501580 seconds.
Yep, transposing S gets us the same contiguous access and performance as the separate vector approach. By the way, L3 vs. L2 is about 4x more clock cycles... 1
Let's see if we can find any breakpoints related to cache size. Here's N = 1000, where everything should fit in L2:
Elapsed time is 0.240184 seconds.
Elapsed time is 0.373448 seconds.
Elapsed time is 0.258566 seconds.
Much lower difference, though now we're probably into L1 effects.
Finally, here's a completely different way to solve your problem. It relies on the fact that multivariate normal RV's have the correct symmetry.
#4
tic
S = randn(M, N);
S = bsxfun(#rdivide, S, sqrt(sum(S.*S, 2)));
toc
Elapsed time is 10.714104 seconds.
Elapsed time is 45.351277 seconds.
Elapsed time is 11.031061 seconds.
Elapsed time is 0.015068 seconds.
I suspect the advantage comes from using a hard coded 1 in the access of the array. If you try M=1 you will still see a significant speed up for the sin(t) line. My guess is that the assembly under the hood can do some use immediate instructions as opposed to reloading the variable K into a register.
There doesn't seem to be any preexisting questions on this, at least from a title search. I am seeking to find the optimal amount of passes for an external merge. So, if we have 1000 chunks of data, one pass would be a 1000 way merge. Two pass could be 5 groups of 200 chunks, then a final merge of 1 group of 5 chunks. And so on. I've done some math, which must have a flaw, because it looks like two passes never beats one pass. It could very well be a misunderstanding in how data is read, though.
First, a numerical example:
Data: 100 GB
Ram: 1 GB
Since we have 1GB memory, we can load in 1GB at a time to sort using quicksort or mergesort. Now we have 100 chunks to sort. We can do a 100 way merge. This is done by making RAM/(chunks+1) size buckets = 1024MB/101 = 10.14MB. There are 100 10.14MB buckets for each of the 100 chunks, and one output bucket also of size 10.14MB. As we merge, if any input buckets empty, we do a disk seek to refill that bucket. Likewise, when the output bucket gets full, we write to the disk and empty it. I claim that the number of "times the disk needs to read" is (data/ram)*(chunks+1). I get this from the fact that we have ram/(chunks+1) sized input buckets, and we must read in the entire data for a given pass, so we read (data/bucket_size) times. In other words, every time an input bucket empties we must refill it. We do this over 100 chunks here, so numChunks*(chunk_size/bucket_size) = datasize/bucket_size or 100*(1024MB/10.14MB). BucketSize = ram/(chunks+1) so 100*(1024/10.14) = (data/ram) * (chunks+1) = 1024*100MB/1024MB * 101 = 10100 reads.
For a two pass system, we do A groups of B #chunks, then a final merge of 1 group of A #chunks. Using previous logic, we have numReads = A*( (data/ram)*(B+1)) + 1*( (data/ram)*(A+1)). We also have A*B = Data/Ram. For instance, 10 groups of 10 chunks, where each chunk is a GB. Here, A = 10 B = 10. 10*10 = 100/1 = 100, which is Data/Ram. This is because Data/Ram was the original number of chunks. For 2 pass, we want to break Data/Ram into A groups of B #chunks.
I'll try to break down the formula here, let D = data, A = #groups, B = #chunks/group, R = RAM
A*(D/R)*(B+1) + 1*(D/R)*(A+1) - This is A times the number of reads of an external merge on B #chunks plus the final merge on A #chunks.
A = D/(R*B) => D^2/(B*R^2) * (B+1) + D/R * (D/(R*B)+1)
(D^2/R^2)*[1 + 2/B] + D/R is number of reads for a 2 pass external merge. For 1 pass, we have (data/ram)*(chunks+1) where chunks = data/ram for 1 pass. Thus, for one pass we have D^2/R^2 + D/R. We see that a 2 pass only reaches that as the chunk size B goes to infinity, and even then the additional final merge gives us D^2/R^2 + D/R. So there must be something about the reads I'm missing, or my math is flawed. Thanks to anyone who takes the time to help me!
You ignore the fact that the total time it takes to read a block of data from disk is the sum of
The access time which is roughly constant and on the order of several milliseconds for rotating hard disk drives.
The transfer time which depends on the size of the data block and the transfer rate.
As the number of chunks increases, the size of the input buffers (you call them buckets) decreases. The smaller the input buffers get, the more pronounced is the effect of the constant access time on the total time is takes to fill a buffer. At a certain point, the time to fill a buffer will be almost completely dominated by the access time. So the total time for a merge pass begins to scale with the number of buffers and not the amount of data read.
That's where additional merge passes can speed up the process. It allows to use fewer and larger input buffers and mitigates the effect of access time.
Edit: Here's a quick back-of-the-envelope calculation to give an idea about where the break-even point is.
The total transfer time can be calculated easily. All the data has to read and written once per pass:
total_transfer_time = num_passes * 2 * data / transfer_rate
The total access time for buffer reads is:
total_access_time = num_passes * num_buffer_reads * access_time
Since there's only a single output buffer, it can be made larger than the input buffers without wasting too much memory, so I'll ignore the access time for writes. The number of buffer reads is data / buffer_size, buffer size is about ram / num_chunks for the one-pass approach, and the number of chunks is data / ram. So we have:
total_access_time1 = num_chunks^2 * access_time
For the two-pass solution, it makes sense to use sqrt(num_chunks) buffers to minimize access time. So buffer size is ram / sqrt(num_chunks) and we have:
total_access_time2 = 2 * (data / (ram / sqrt(num_chunks))) * acccess_time
= 2 * num_chunks^1.5 * access_time
So if we use transfer_rate = 100 MB/s, access_time = 10 ms, data = 100 GB, ram = 1 GB, the total time is:
total_time1 = (2 * 100 GB / 100 MB/s) + 100^2 * 10 ms
= 2000 s + 100 s = 2100 s
total_time2 = (2 * 2 * 100 GB / 100 MB/s) + 2 * 100^1.5 * 10 ms
= 4000 s + 20 s = 4020 s
The effect of access time is still very small. So let's change data to 1000 GB:
total_time1 = (2 * 1000 GB / 100 MB/s) + 1000^2 * 10 ms
= 20000 s + 10000 s = 30000 s
total_time2 = (2 * 2 * 1000 GB / 100 MB/s) + 2 * 1000^1.5 * 10 ms
= 40000 s + 632 s = 40632 s
Now half the time in the one-pass version is spent with disk seeks. Let's try with 5000 GB:
total_time1 = (2 * 5000 GB / 100 MB/s) + 5000^2 * 10 ms
= 100000 s + 250000 s = 350000 s
total_time2 = (2 * 2 * 5000 GB / 100 MB/s) + 2 * 5000^1.5 * 10 ms
= 200000 s + 7071 s = 207071 s
Now the two-pass version is faster.
To get an optimum you need a more sophisticated model of the disk. Let time to fill a block of size S be rS + k where k is seek time and r is read rate.
If you divide RAM of size M into C+1 buffers of size M/(C+1), then the time to load RAM once is (C+1) (r M/(C+1) + k) = rM + k(C+1). So as you'd expect, making C smaller speeds up read time by eliminating seeks. It's fastest to read all of memory in one sequential block, but merging doesn't allow it. We must make a tradeoff. That's where we need to look for the optimum.
With total data size of c times RAM size, there are c chunks to be merged.
In the one pass scheme, C=c, and the total read time must be just the time to fill RAM c times over: c (rM + k(c+1)) = c(rM + kc + k).
In the two pass scheme with an N-way division of data for the first pass, that pass has C=c/N and in the second pass, C=N. So total cost is
c ( rM + k(c/N+1) ) + c ( rM + k(N+1) ) = c ( 2rM + k(c/N + N) + 2k )
Note this model omits write time. You should fill that in eventually unless you're assuming it's overlapped I/O on a different device and thus can be ignored.
It's not hard to see here that if c and k are suitably large, then the c/N+N term in the 2-pass model can be so small compared to the c in the one-pass that the 2-pass model will be faster.
I'm going to stop now, but you can carry this logic on to (probably) get a closed approximation formula for an arbitrary number of passes. THis will require solving an infinite series. Then you can set the derivative to zero and solve for an estimate of the optimal pass number. If life is good you'll also learn the optimal value of N by setting the gradient of a 2d function in pass number and N to zero. My intuition says N ~ sqrt(c).
If the math gets intractable, you could still simulate a reasonable range of numbers of passes with the kind of simple algebra above at the start and pick an optimum that way.
This is an interesting problem and I'm sorry I don't have more time to work on it at the moment. I hope the analysis framework is enough to let you punch through to a nice result.