Calculating CPU Performance in MIPS - performance

i was taking an exam earlier and i memorized the questions that i didnt know how to answer but somehow got it correct(since the online exam using electronic classrom(eclass) was done through the use of multiple choice.. The exam was coded so each of us was given random questions at random numbers and random answers on random choices, so yea)
anyways, back to my questions..
1.)
There is a CPU with a clock frequency of 1 GHz. When the instructions consist of two
types as shown in the table below, what is the performance in MIPS of the CPU?
-Execution time(clocks)- Frequency of Appearance(%)
Instruction 1 10 60
Instruction 2 15 40
Answer: 125
2.)
There is a hard disk drive with specifications shown below. When a record of 15
Kbytes is processed, which of the following is the average access time in milliseconds?
Here, the record is stored in one track.
[Specifications]
Capacity: 25 Kbytes/track
Rotation speed: 2,400 revolutions/minute
Average seek time: 10 milliseconds
Answer: 37.5
3.)
Assume a magnetic disk has a rotational speed of 5,000 rpm, and an average seek time of 20 ms. The recording capacity of one track on this disk is 15,000 bytes. What is the average access time (in milliseconds) required in order to transfer one 4,000-byte block of data?
Answer: 29.2
4.)
When a color image is stored in video memory at a tonal resolution of 24 bits per pixel,
approximately how many megabytes (MB) are required to display the image on the
screen with a resolution of 1024 x768 pixels? Here, 1 MB is 106 bytes.
Answer:18.9
5.)
When a microprocessor works at a clock speed of 200 MHz and the average CPI
(“cycles per instruction” or “clocks per instruction”) is 4, how long does it take to
execute one instruction on average?
Answer: 20 nanoseconds
I dont expect someone to answer everything, although they are indeed already answered but i am just wondering and wanting to know how it arrived at those answers. Its not enough for me knowing the answer, ive tried solving it myself trial and error style to arrive at those numbers but it seems taking mins to hours so i need some professional help....

1.)
n = 1/f = 1 / 1 GHz = 1 ns.
n*10 * 0.6 + n*15 * 0.4 = 12 ns (=average instruction time) = 83.3 MIPS.
2.)3.)
I don't get these, honestly.
4.)
Here, 1 MB is 10^6 bytes.
3 Bytes * 1024 * 768 = 2359296 Bytes = 2.36 MB
But often these 24 bits are packed into 32 bits b/c of the memory layout (word width), so often it will be 4 Bytes*1024*768 = 3145728 Bytes = 3.15 MB.
5)
CPI / f = 4 / 200 MHz = 20 ns.

Related

times of the Fortran matmul function with different multiplication sizes

I have calculated the time spent by the Fortran's MATMUL function with different multiplication sizes (32 × 32, 64 × 64, ...) and I have questions about the results.
These are the results:
SIZE ----- TIME IN SECONDS
32 ----- 0,000071
64 ----- 0,000032
128 ----- 0,001889
256 ----- 0,010866
512 ----- 0,043
1024 ----- 0,336
2048 ----- 2,878
4096 ----- 51,932
8192 ----- 405,921856
I guess the times should increase by a factor of 8 (m * 2 * n * 2 * k * 2). I do not know if it should be like that. If so, who can tell why it is not like that?
In addition, we see an increase of a factor of 18 with multiplications of 2048 a
4096. Could someone tell me why?
I have measured the times with CALL CPU_TIME() from Fortran and with CALL DATE_AND_TIME() from Fortran and both give very similar results.
My processor is an AMD Phenom (tm) II X4 945 Processor with 4 cores
#Steve is correct, there are many factors that affect performance especially when data sizes are small. Thats why all of your results at and below 2048 are pretty much semi-random and essentially irrelevant. All or most of the data is likely in several layers of CPU cache. So flushing CPU threads and other hardware related events are making these results very skewed. If you run these tests again you will find different results at these small sizes.
So, when you go from 2048 to 4096 you get a major jump. All the data no longer fits into the CPU caches. The computer needs to load blocks of data from RAM into the CPU caches. This explains the large jump in time.
It is at these sizes and larger that the computer has to do more typical operations (load data, perform operations, save data to RAM) and this is the performance you will get as data gets even larger. This is also where performance becomes very consistent as data grows larger. Notice that going from 4096 to 8192 is very close to exactly 8 times longer. At this point, going to 16384 will take almost exactly 8 times 406 seconds.
Any size smaller than 4096 is not giving your computer enough work to accurately measure the performance.
There should be a factor 8 between each timing, and deviations are generally due to memory management like cache alignment and cache- vs array-size. For small arrays there might be a calling overhead to matmul(). A triple do-loop can be faster, at least with some optimization (try -O3 -march=native), and should work equally well for small sizes.

FFT performance figures

I am wondering what time performance one can achieve nowadays to compute 2D FFTs. Just an order of magnitude, for 1K x 1K or 2K x 2K images.
Links or personal experience are welcome.
Rerun simple test for reference:
FFTW library 3.3.5 (2016 year). I've used precompiled dll's, they exploit SSE, but I am not sure about AVX.
Windows 7 32 bit. Intel i5-4670 (Haswell 4 cores)
Single precision, real-to complex out-of place 2D transform (using fftwf_plan_dft_r2c_2d).
1024 x 1024:
Single thread: 5 ms per iteration
Two threads: 3.8 ms per iteration
Four threads: 2.4 ms per iteration
2048 x 2048:
Single thread: 28 ms per iteration
Two threads: 16 ms per iteration
Four threads: 12 ms per iteration
Double precision, real-to complex out-of place 2D transform (using fftw_plan_dft_r2c_2d).
1024 x 1024:
Single thread: 7 ms per iteration
Four threads: 3 ms per iteration

File read time in c increase unexpectedly

I'm currently facing an annoying problem, I have to read a big data file (500 GO) which is stored on a SSD revodrive 350.
I read the file using fread function as big memory chunks (roughly 17 mo per chunk).
At the beginning of my program everything goes smoothly It takes 10ms for 3 chunks read. Then after 10 sec read time performances collapse and vary between 60 and 90 ms.
I don't know the reason why this is happening and if it is possible to keep read time stable ?
Thank you in advance
Rob
17 mo per chunk, 10 ms for 3 chunks -> 51 mo / 10 ms.
10 sec = 1000 x 10 ms -> 51 GO read after 10 seconds!
How much memory do you have? Is your pagefile on the same disk?
The system may swap memory!

Measuring effective bandwidth on CUDA

So I want to know how to calculate the total memory effective bandwidth for:
cublasSdot(handle, M, devPtrA, 1, devPtrB, 1, &curesult);
where that function belows to cublas_v2.h
That function runs in 0.46 ms, and the vectors are 10000 * sizeof(float)
Am I having ((10000 * 4) / 10^9 )/0.00046 = 0.086 GB/s?
I'm wondering about it because I don't know what is inside the cublasSdot function, and I don't know if it is necesary.
In your case, the size of the input data is 10000 * 4 * 2 since you have 2 input vectors, and the size of the output data is 4. The effective bandwidth should be about 0.172 GB/s.
Basically cublasSdot() does nothing much more than computing.
Profile result shows cublasSdot() invokes 2 kernels to compute the result. An extra 4-bytes device-to-host mem transfer is also invoked if the pointer mode is CUBLAS_POINTER_MODE_HOST, which is the default mode for cublas lib.
If kernel time is in ms then a multiplication factor of 1000 is necessary.
That results in 86 GB/s.
As an example refer to example provide by NVIDIA for Matrix Transpose
at http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf
On Last Page entire code is present. The way the Effective Bandwidth is computed is 2.*1000*mem_size/(1024*1024*1024)/(Time in ms)

ISR - Maximum Data Rate

The interrupt service routine (ISR) for a device transfers 4 bytes of data from the
device on each device interrupt. On each interrupt, the ISR executes 90 instructions
with each instruction taking 2 clock cycles to execute. The CPU takes 20 clock cycles
to respond to an interrupt request before the ISR starts to execute instructions.
Calculate the maximum data rate, in bits per second, that can be input from this
device, if the CPU clock frequency is 100MHz.
Any help on how to solve will be appreciated.
What I'm thinking - 90 instructions x 2 cycles = 180
20 cycles delay = 200 cycles per one interrupt
so in 100mhz = 100million cycles = 100million/200 = 500,000 cycles each with 4 bytes
so 2million bytes or 16million bits
I think its right but im not 100% sure can anyone confirm?
cheers/
Your calculation looks good to me. If you want an "Engineering answer" then I'd add a 10% margin. Something like: "Theoretical max data rate is 16m bits per sec. Using a 10% margin, no more that 14.4m bits per sec"

Resources