Multilevel cache and maximum miss rate - caching

the first level(L1) has hit rate 600 psec, miss rate 10% and miss penalnty 80 nsec. I add a second level cache(L2) with hit rate 5 nsec. I am trying to find th e maximum miss rate for the second level,considering that the combination of the caches (L1 + L2) has double efficiency than the one-level cache L1.
I am using these forms
Average memory access time = Hit time (L1) + Miss rate (L1) x Miss penalty (L1)
Miss penalty (L1) = Hit time (L2) + Miss rate (L2) x Miss penalty (L2)
the solution i get is 40 %,but the correct answer is 9,25 %.
can anyone help?
THANKS IN ADVANCE

avg = 8.6 = 0.6 + 0.1*80
1/2*avg = 4.3 = 0.6 + 0.1*(5 + x*80)
=> 3.2 = x*8
=> x = 0.4
So, it seems that you answer is correct under assumptions that
- "Average memory access time" does not include any other time value for various secondary effects;
- Double efficiency means that it takes a half time in average.

Related

performance of single-cycle processor

Given:
Operation time required by:
memory units: 200 ps
ALU and adders: 100 ps
Register file: 50 ps
other units and wires: no delay
Instruction mix and operation time in ps:
25% loads (600 ps)
10% stores ( 550 ps)
45% ALU instructions (400 ps)
15% branches (350 ps)
5% jumps (200 ps)
Every instruction executes in 1 clock cycle
Two implementations: fixed length and variable length
Which implementation would be faster and by how much?
Solution
Reference Table
Rule: CPU execution time: IC * CPI * CCT
Since CPI = 1...
CPU execution time: IC * CCT
My questions are:
What does it mean when an implementation has variable / fixed length?
How were the values for CPU execution timesingleclock calculated?
What does it mean when an implementation has variable / fixed length?
Fixed length clock means that each clock cycle has the same period, irrespective of the instruction being execution. Variable length clock means that different clock cycles may have different periods, depending on the instruction being executed.
So in a fixed clock design, the clock cycle has to be at least 600 ps, which is the longest time any instruction would take to execute (the load instruction). In a variable clock design, we can calculate the average clock cycle as follows:
Average CPU clock cycle = 600*25% + 550*10% + 400*45% + 350*15% + 200*5% = 447.5 ps
How were the values for CPU execution timesingleclock
calculated?
To determine which implementation is faster, you need to measure speedup, which is defined as:
Speedup = CPU execution time(single) / CPU execution time(variable)
Using the definition of CPU execution time we get (note that the number of instructions is the same):
Speedup = CPU execution time(single) / CPU execution time(variable)
= (Instruction count * Clock cycle time(single)) / (Instruction count * Clock cycle time(variable))
= Clock cycle time(single) / Clock cycle time(variable)
= 600 / 447.5 = 1.34
Se the variable clock design is 1.34 faster.
Regarding CPU execution timevariable
CPU execution timevariable is technically equal to the sum of the individual clock cycle times of each executed instruction. But we used the average clock cycle time instead to calculate speedup. Will we get the same result either way? Let's find out!
Assume there are N executed instructions and let C1, C2, ..., CN denote the cycle times of each of them, respectively. Hence:
CPU execution time(variable) = C1 + C2 + ... + CN
= 600*25%*N + 550*10%*N + 400*45%*N + 350*15%*N + 200*5%*N
= N * average CPU clock cycle
So they are the same.

Calculate execution time of a program based on CPI, instructions, etc

I'm trying to calculate the execution time of an application. Assuming the only stall penalty occurs on memory access instructions (100 cycles being the penalty).
How am I supposed to find out execution time in seconds with this info?
CPI (CPUCycles?) = 1.0
ClockRate = 1GHZ
TotalInstructions = 59880
MemoryAccessInstructions = 8467
CacheMissRate = 62% (0.62) (5290/8467)
CacheHits = 3117
CacheMisses = 5290
CacheMissPenalty = 100 (cycles)
Assuming no other penalties.
totalCycles = TotalInstructions + CacheMisses * CacheMissPenalty ?
I assume that cache hits cost same as other opcodes, so those are included in TotalInstructions.
That's then 588880 cycles, 1GHz is 1000000000 cycles per second.
So that code will take 0.58888ms to execute (5.8888e-7 second).
This value is of course purely theoretical estimate, as modern CPU doesn't work like that (1 instruction = 1 cycle). If you are interested in real world values, just profile it.

Analytical way of speeding up exp(A*x) in MATLAB

I need to calculate f(x)=exp(A*x) repeatedly for a tiny, variable column vector x and a huge, constant matrix A (many rows, few columns). In other words, the x are few, but the A*x are many. My problem dimensions are such that A*x takes about as much runtime as the exp() part.
Apart from Taylor expansion and pre-calculating a range of values exp(y) (assuming known the range y of values of A*x), which I haven't managed to speed up considerably (while maintaining accuracy) with respect to what MATLAB is doing on its own, I am thinking about analytically restating the problem in order to be able to precalculate some values.
For example, I find that exp(A*x)_i = exp(\sum_j A_ij x_j) = \prod_j exp(A_ij x_j) = \prod_j exp(A_ij)^x_j
This would allow me to precalculate exp(A) once, but the required exponentiation in the loop is as costly as the original exp() function call, and the multiplications (\prod) have to be carried out in addition.
Is there any other idea that I could follow, or solutions within MATLAB that I may have missed?
Edit: some more details
A is 26873856 by 81 in size (yes, it's that huge), so x is 81 by 1.
nnz(A) / numel(A) is 0.0012, nnz(A*x) / numel(A*x) is 0.0075. I already use a sparse matrix to represent A, however, exp() of a sparse matrix is not sparse any longer. So in fact, I store x non-sparse and I calculate exp(full(A*x)) which turned out to be as fast/slow as full(exp(A*x)) (I think A*x is non-sparse anyway, since x is non-sparse.) exp(full(A*sparse(x))) is a way to have a sparse A*x, but is slower. Even slower variants are exp(A*sparse(x)) (with doubled memory impact for a non-sparse matrix of type sparse) and full(exp(A*sparse(x)) (which again yields a non-sparse result).
sx = sparse(x);
tic, for i = 1 : 10, exp(full(A*x)); end, toc
tic, for i = 1 : 10, full(exp(A*x)); end, toc
tic, for i = 1 : 10, exp(full(A*sx)); end, toc
tic, for i = 1 : 10, exp(A*sx); end, toc
tic, for i = 1 : 10, full(exp(A*sx)); end, toc
Elapsed time is 1.485935 seconds.
Elapsed time is 1.511304 seconds.
Elapsed time is 2.060104 seconds.
Elapsed time is 3.194711 seconds.
Elapsed time is 4.534749 seconds.
Yes, I do calculate element-wise exp, I update the above equation to reflect that.
One more edit: I tried to be smart, with little success:
tic, for i = 1 : 10, B = exp(A*x); end, toc
tic, for i = 1 : 10, C = 1 + full(spfun(#(x) exp(x) - 1, A * sx)); end, toc
tic, for i = 1 : 10, D = 1 + full(spfun(#(x) exp(x) - 1, A * x)); end, toc
tic, for i = 1 : 10, E = 1 + full(spfun(#(x) exp(x) - 1, sparse(A * x))); end, toc
tic, for i = 1 : 10, F = 1 + spfun(#(x) exp(x) - 1, A * sx); end, toc
tic, for i = 1 : 10, G = 1 + spfun(#(x) exp(x) - 1, A * x); end, toc
tic, for i = 1 : 10, H = 1 + spfun(#(x) exp(x) - 1, sparse(A * x)); end, toc
Elapsed time is 1.490776 seconds.
Elapsed time is 2.031305 seconds.
Elapsed time is 2.743365 seconds.
Elapsed time is 2.818630 seconds.
Elapsed time is 2.176082 seconds.
Elapsed time is 2.779800 seconds.
Elapsed time is 2.900107 seconds.
Computers don't really do exponents. You would think they do, but what they do is high-accuracy polynomial approximations.
References:
http://www.math.vanderbilt.edu/~esaff/texts/13.pdf
http://deepblue.lib.umich.edu/bitstream/handle/2027.42/33109/0000495.pdf
http://www.cs.yale.edu/homes/sachdeva/pubs/fast-algos-via-approx-theory.pdf
The last reference looked quite nice. Perhaps it should have been first.
Since you are working on images, you likely have discrete number of intensity levels (255 typically). This can allow reduced sampling, or lookups, depending on the nature of "A". One way to check this is to do something like the following for a sufficiently representative group of values of "x":
y=Ax
cdfplot(y(:))
If you were able to pre-segment your images into "more interesting" and "not as interesting" - like if you were looking at an x-ray being able to trim out all the "outside the human body" locations and clamp them to zero to pre-sparsify your data, that could reduce your number of unique values. You might consider the previous for each unique "mode" inside the data.
My approaches would include:
look at alternate formulations of exp(x) that are lower accuracy but higher speed
consider table lookups if you have few enough levels of "x"
consider a combination of interpolation and table lookups if you have "slightly too many" levels to do a table lookup
consider a single lookup (or alternate formulation) based on segmented mode. If you know it is a bone and are looking for a vein, then maybe it should get less high-cost data processing applied.
Now I have to ask myself why would you be living in so many iterations of exp(A*x)*x and I think you might be switching back and forth between frequency/wavenumber domain and time/space domain. You also might be dealing with probabilities using exp(x) as a basis, and doing some Bayesian fun. I don't know that exp(x) is a good conjugate prior, so I'm going to go with the fourier material.
Other options:
- consider use of fft, fft2, or fftn given your matrices - they are fast and might do part of what you are looking for.
I am sure there is a forier domain variation on the following:
https://mathoverflow.net/questions/34173/fast-matrix-multiplication
http://www-cc.cs.uni-saarland.de/media/oldmaterial/bc.pdf
http://arxiv.org/PS_cache/math/pdf/0511/0511460v1.pdf
You might be able to mix the lookup with a compute using the woodbury matrix. I would have to think about that some to be sure though. (link) At one point I knew that everything that mattered (CFD, FEA, FFT) were all about the matrix inversion, but I have since forgotten the particular details.
Now, if you are living in MatLab then you might consider using "coder" which converts MatLab code to c-code. No matter how much fun an interpreter may be, a good c-compiler can be a lot faster. The mnemonic (hopefully not too ambitious) that I use is shown here: link starting around 13:49. It is really simple, but it shows the difference between a canonical interpreted language (python) and compiled version of the same (cython/c).
I'm sure that if I had some more specifics, and was requested to, then I could engage more aggressively in a more specifically relevant answer.
You might not have a good way to do it on conventional hardware, buy you might consider something like a GPGPU. CUDA and its peers have massively parallel operations that allow substantial speedup for the cost of a few video cards. You can have thousands of "cores" (overglorified pipelines) doing the work of a few ALU's and if the job is properly parallelizable (as this looks like) then it can get done a LOT faster.
EDIT:
I was thinking about Eureqa. One option that I would consider if I had some "big iron" for development but not production would be to use their Eureqa product to come up with a fast enough, accurate enough approximation.
If you performed a 'quick' singular value decomposition of your "A" matrix, you would find that the dominant performance is governed by 81 eigenvectors. I would look at the eigenvalues and see if there were only a few of those 81 eigenvectors providing the majority of the information. If that was the case, then you can clamp the others to zero, and construct a simple transformation.
Now, if it were me, I would want to get "A" out of the exponent. I'm wondering if you can look at the 81x81 eigenvector matrix and "x" and think a little about linear algebra, and what space you are projecting your vectors into. Is there any way that you can make a function that looks like the following:
f(x) = B2 * exp( B1 * x )
such that the
B1 * x
is much smaller rank than your current
Ax.

Number of passes in external merge

There doesn't seem to be any preexisting questions on this, at least from a title search. I am seeking to find the optimal amount of passes for an external merge. So, if we have 1000 chunks of data, one pass would be a 1000 way merge. Two pass could be 5 groups of 200 chunks, then a final merge of 1 group of 5 chunks. And so on. I've done some math, which must have a flaw, because it looks like two passes never beats one pass. It could very well be a misunderstanding in how data is read, though.
First, a numerical example:
Data: 100 GB
Ram: 1 GB
Since we have 1GB memory, we can load in 1GB at a time to sort using quicksort or mergesort. Now we have 100 chunks to sort. We can do a 100 way merge. This is done by making RAM/(chunks+1) size buckets = 1024MB/101 = 10.14MB. There are 100 10.14MB buckets for each of the 100 chunks, and one output bucket also of size 10.14MB. As we merge, if any input buckets empty, we do a disk seek to refill that bucket. Likewise, when the output bucket gets full, we write to the disk and empty it. I claim that the number of "times the disk needs to read" is (data/ram)*(chunks+1). I get this from the fact that we have ram/(chunks+1) sized input buckets, and we must read in the entire data for a given pass, so we read (data/bucket_size) times. In other words, every time an input bucket empties we must refill it. We do this over 100 chunks here, so numChunks*(chunk_size/bucket_size) = datasize/bucket_size or 100*(1024MB/10.14MB). BucketSize = ram/(chunks+1) so 100*(1024/10.14) = (data/ram) * (chunks+1) = 1024*100MB/1024MB * 101 = 10100 reads.
For a two pass system, we do A groups of B #chunks, then a final merge of 1 group of A #chunks. Using previous logic, we have numReads = A*( (data/ram)*(B+1)) + 1*( (data/ram)*(A+1)). We also have A*B = Data/Ram. For instance, 10 groups of 10 chunks, where each chunk is a GB. Here, A = 10 B = 10. 10*10 = 100/1 = 100, which is Data/Ram. This is because Data/Ram was the original number of chunks. For 2 pass, we want to break Data/Ram into A groups of B #chunks.
I'll try to break down the formula here, let D = data, A = #groups, B = #chunks/group, R = RAM
A*(D/R)*(B+1) + 1*(D/R)*(A+1) - This is A times the number of reads of an external merge on B #chunks plus the final merge on A #chunks.
A = D/(R*B) => D^2/(B*R^2) * (B+1) + D/R * (D/(R*B)+1)
(D^2/R^2)*[1 + 2/B] + D/R is number of reads for a 2 pass external merge. For 1 pass, we have (data/ram)*(chunks+1) where chunks = data/ram for 1 pass. Thus, for one pass we have D^2/R^2 + D/R. We see that a 2 pass only reaches that as the chunk size B goes to infinity, and even then the additional final merge gives us D^2/R^2 + D/R. So there must be something about the reads I'm missing, or my math is flawed. Thanks to anyone who takes the time to help me!
You ignore the fact that the total time it takes to read a block of data from disk is the sum of
The access time which is roughly constant and on the order of several milliseconds for rotating hard disk drives.
The transfer time which depends on the size of the data block and the transfer rate.
As the number of chunks increases, the size of the input buffers (you call them buckets) decreases. The smaller the input buffers get, the more pronounced is the effect of the constant access time on the total time is takes to fill a buffer. At a certain point, the time to fill a buffer will be almost completely dominated by the access time. So the total time for a merge pass begins to scale with the number of buffers and not the amount of data read.
That's where additional merge passes can speed up the process. It allows to use fewer and larger input buffers and mitigates the effect of access time.
Edit: Here's a quick back-of-the-envelope calculation to give an idea about where the break-even point is.
The total transfer time can be calculated easily. All the data has to read and written once per pass:
total_transfer_time = num_passes * 2 * data / transfer_rate
The total access time for buffer reads is:
total_access_time = num_passes * num_buffer_reads * access_time
Since there's only a single output buffer, it can be made larger than the input buffers without wasting too much memory, so I'll ignore the access time for writes. The number of buffer reads is data / buffer_size, buffer size is about ram / num_chunks for the one-pass approach, and the number of chunks is data / ram. So we have:
total_access_time1 = num_chunks^2 * access_time
For the two-pass solution, it makes sense to use sqrt(num_chunks) buffers to minimize access time. So buffer size is ram / sqrt(num_chunks) and we have:
total_access_time2 = 2 * (data / (ram / sqrt(num_chunks))) * acccess_time
= 2 * num_chunks^1.5 * access_time
So if we use transfer_rate = 100 MB/s, access_time = 10 ms, data = 100 GB, ram = 1 GB, the total time is:
total_time1 = (2 * 100 GB / 100 MB/s) + 100^2 * 10 ms
= 2000 s + 100 s = 2100 s
total_time2 = (2 * 2 * 100 GB / 100 MB/s) + 2 * 100^1.5 * 10 ms
= 4000 s + 20 s = 4020 s
The effect of access time is still very small. So let's change data to 1000 GB:
total_time1 = (2 * 1000 GB / 100 MB/s) + 1000^2 * 10 ms
= 20000 s + 10000 s = 30000 s
total_time2 = (2 * 2 * 1000 GB / 100 MB/s) + 2 * 1000^1.5 * 10 ms
= 40000 s + 632 s = 40632 s
Now half the time in the one-pass version is spent with disk seeks. Let's try with 5000 GB:
total_time1 = (2 * 5000 GB / 100 MB/s) + 5000^2 * 10 ms
= 100000 s + 250000 s = 350000 s
total_time2 = (2 * 2 * 5000 GB / 100 MB/s) + 2 * 5000^1.5 * 10 ms
= 200000 s + 7071 s = 207071 s
Now the two-pass version is faster.
To get an optimum you need a more sophisticated model of the disk. Let time to fill a block of size S be rS + k where k is seek time and r is read rate.
If you divide RAM of size M into C+1 buffers of size M/(C+1), then the time to load RAM once is (C+1) (r M/(C+1) + k) = rM + k(C+1). So as you'd expect, making C smaller speeds up read time by eliminating seeks. It's fastest to read all of memory in one sequential block, but merging doesn't allow it. We must make a tradeoff. That's where we need to look for the optimum.
With total data size of c times RAM size, there are c chunks to be merged.
In the one pass scheme, C=c, and the total read time must be just the time to fill RAM c times over: c (rM + k(c+1)) = c(rM + kc + k).
In the two pass scheme with an N-way division of data for the first pass, that pass has C=c/N and in the second pass, C=N. So total cost is
c ( rM + k(c/N+1) ) + c ( rM + k(N+1) ) = c ( 2rM + k(c/N + N) + 2k )
Note this model omits write time. You should fill that in eventually unless you're assuming it's overlapped I/O on a different device and thus can be ignored.
It's not hard to see here that if c and k are suitably large, then the c/N+N term in the 2-pass model can be so small compared to the c in the one-pass that the 2-pass model will be faster.
I'm going to stop now, but you can carry this logic on to (probably) get a closed approximation formula for an arbitrary number of passes. THis will require solving an infinite series. Then you can set the derivative to zero and solve for an estimate of the optimal pass number. If life is good you'll also learn the optimal value of N by setting the gradient of a 2d function in pass number and N to zero. My intuition says N ~ sqrt(c).
If the math gets intractable, you could still simulate a reasonable range of numbers of passes with the kind of simple algebra above at the start and pick an optimum that way.
This is an interesting problem and I'm sorry I don't have more time to work on it at the moment. I hope the analysis framework is enough to let you punch through to a nice result.

What's the best way to calculate remaining download time?

Let's say you want to calculate the remaining download time, and you have all the information needed, that is: File size, dl'ed size, size left, time elapsed, momentary dl speed, etc'.
How would you calculate the remaining dl time?
Ofcourse, the straightforward way would be either: size left/momentary dl speed, or: (time elapsed/dl'ed size)*size left.
Only that the first would be subject to deviations in the momentary speed, and the latter wouldn't adapt well to altering speeds.
Must be some smarter way to do that, right? Take a look at the pirated software and music you currently download with uTorrent. It's easy to notice that it does more than the simple calculation mentioned before. Actually, I notices that sometimes when the dl speed drops, the time remaining also drops for a couple of moments until it readjusts.
Well, as you said, using the absolutely current download speed isn't a great method, because it tends to fluctuate. However, something like an overall average isn't a great idea either, because there may be large fluctuations there as well.
Consider if I start downloading a file at the same time as 9 others. I'm only getting 10% of my normal speed, but halfway through the file, the other 9 finish. Now I'm downloading at 10x the speed I started at. My original 10% speed shouldn't be a factor in the "how much time is left?" calculation any more.
Personally, I'd probably take an average over the last 30 seconds or so, and use that. That should do calculations based on recent speed, without fluctuating wildly. 30 seconds may not be the right amount, it would take some experimentation to figure out a good amount.
Another option would be to set a sort of "fluctuation threshold", where you don't do any recalculation until the speed changes by more than that threshold. For example (random number, again, would require experimentation), you could set the threshold at 10%. Then, if you're downloading at 100kb/s, you don't recalculate the remaining time until the download speed changes to either below 90kb/s or 110kb/s. If one of those changes happens, the time is recalculated and a new threshold is set.
You could use an averaging algorithm where the old values decay linearly. If S_n is the speed at time n and A_{n-1} is the average at time n-1, then define your average speed as follows.
A_1 = S_1
A_2 = (S_1 + S_2)/2
A_n = S_n/(n-1) + A_{n-1}(1-1/(n-1))
In English, this means that the longer in the past a measurement occurred, the less it matters because its importance has decayed.
Compare this to the normal averaging algorithm:
A_n = S_n/n + A_{n-1}(1-1/n)
You could also have it geometrically decay, which would weight the most recent speeds very heavily:
A_n = S_n/2 + A_{n-1}/2
If the speeds are 4,3,5,6 then
A_4 = 4.5 (simple average)
A_4 = 4.75 (linear decay)
A_4 = 5.125 (geometric decay)
Example in PHP
Beware that $n+1 (not $n) is the number of current data points due to PHP's arrays being zero-indexed. To match the above example set n == $n+1 or n-1 == $n
<?php
$s = [4,3,5,6];
// average
$a = [];
for ($n = 0; $n < count($s); ++$n)
{
if ($n == 0)
$a[$n] = $s[$n];
else
{
// $n+1 = number of data points so far
$weight = 1/($n+1);
$a[$n] = $s[$n] * $weight + $a[$n-1] * (1 - $weight);
}
}
var_dump($a);
// linear decay
$a = [];
for ($n = 0; $n < count($s); ++$n)
{
if ($n == 0)
$a[$n] = $s[$n];
elseif ($n == 1)
$a[$n] = ($s[$n] + $s[$n-1]) / 2;
else
{
// $n = number of data points so far - 1
$weight = 1/($n);
$a[$n] = $s[$n] * $weight + $a[$n-1] * (1 - $weight);
}
}
var_dump($a);
// geometric decay
$a = [];
for ($n = 0; $n < count($s); ++$n)
{
if ($n == 0)
$a[$n] = $s[$n];
else
{
$weight = 1/2;
$a[$n] = $s[$n] * $weight + $a[$n-1] * (1 - $weight);
}
}
var_dump($a);
Output
array (size=4)
0 => int 4
1 => float 3.5
2 => float 4
3 => float 4.5
array (size=4)
0 => int 4
1 => float 3.5
2 => float 4.25
3 => float 4.8333333333333
array (size=4)
0 => int 4
1 => float 3.5
2 => float 4.25
3 => float 5.125
The obvious way would be something in between, you need a 'moving average' of the download speed.
I think it's just an averaging algorithm. It averages the rate over a few seconds.
What you could do also is keep track of your average speed and show a calculation of that as well.
For anyone interested, I wrote an open-source library in C# called Progression that has a "moving-average" implementation: ETACalculator.cs.
The Progression library defines an easy-to-use structure for reporting several types of progress. It also easily handles nested progress for very smooth progress reporting.
EDIT: Here's what I finally suggest, I tried it and it provides quite satisfying results:
I have a zero initialized array for each download speed between 0 - 500 kB/s (could be higher if you expect such speeds) in 1 kB/s steps.
I sample the download speed momentarily (every second is a good interval), and increment the coresponding array item by one.
Now I know how many seconds I have spent downloading the file at each speed. The sum of all these values is the elapsed time (in seconds). The sum of these values multiplied by the corresponding speed is the size downloaded so far.
If I take the ratio between each value in the array and the elapsed time, assuming the speed variation pattern stabalizes, I can form a formula to predict the time each size will take. This size in this case, is the size remaining. That's what I do:
I take the sum of each array item value multiplied by the corresponding speed (the index) and divided by the elapsed time. Then I divide the size left by this value, and that's the time left.
Takes a few seconds to stabalize, and then works preety damn well.
Note that this is a "complicated" average, so the method of discarding older values (moving average) might improve it even further.

Resources