ISR - Maximum Data Rate - algorithm

The interrupt service routine (ISR) for a device transfers 4 bytes of data from the
device on each device interrupt. On each interrupt, the ISR executes 90 instructions
with each instruction taking 2 clock cycles to execute. The CPU takes 20 clock cycles
to respond to an interrupt request before the ISR starts to execute instructions.
Calculate the maximum data rate, in bits per second, that can be input from this
device, if the CPU clock frequency is 100MHz.
Any help on how to solve will be appreciated.
What I'm thinking - 90 instructions x 2 cycles = 180
20 cycles delay = 200 cycles per one interrupt
so in 100mhz = 100million cycles = 100million/200 = 500,000 cycles each with 4 bytes
so 2million bytes or 16million bits
I think its right but im not 100% sure can anyone confirm?
cheers/

Your calculation looks good to me. If you want an "Engineering answer" then I'd add a 10% margin. Something like: "Theoretical max data rate is 16m bits per sec. Using a 10% margin, no more that 14.4m bits per sec"

Related

Intructions vs cycles per second - what is actually measured in Hertz?

So I am going through some tutorials, and it seems they keep using "instructions" and "cycles" interchangeably, so now I am confused what is actually measured in Hertz (on the most basic level, without going into what the modern processors can do in parallel etc, trying to learn the basics here).
Say, the program is as follows: load two numbers, add them, store result.
So there will be 4 cycles:
load number A [fetch-decode-execute]
load number B [fetch-decode-execute]
add A and B [fetch-decode-execute]
store result [fetch-decode-execute]
What is a cycle here, and what is an instruction?
There are 4 cycles, or 12 instructions, correct?
Say, it takes CPU 1 sec to run this program.
What will be the CPU clock speed? 12 instructions/1 sec or 4 cycles/1 sec?
If the former one, then is the clock speed of the CPU 12 Hertz?
If the latter one, then is the clock speed of the CPU 4 Hertz?
From helpful comments by #Nate Eldredge:
"A fetch-decode-execute cycle is one instruction cycle, but three clock cycles.
The clock speed measures the number of clock cycles per second."
Thus, if the program is executed within 1 second, and it takes 12 clock cycles, the clock speed of that particular CPU is 12 Hz.

Ram real time latency

I read somewhere that to find the real latency of the ram you can use the following rule :
1/((RAMspeed/2)/1000) x CL = True Latency in nanoseconds
ie for DDR1 with 400Mhz clock speed, is it logical to me to divide by 2 to get the FSB speed or the real bus speed which is 200Mhz in this case. So the rule above seems to be correct for the DDR1.
From the other side, DDR2 also doubles the freq of the bus in relation to the previous DDR1 generation (ie 4 bits per clock cycle) according the article "What Every Programmer Should Know About Memory".
So, in the case of a DDR2 with a 800Mhz clock speed, to find the "True Latency" the above rule should be accordingly changed to
1/((RAMspeed/4)/1000) x CL = True Latency in nanoseconds
Is that correct? Because in all the cases I read that the correct way is to take RAMspeed/2 no matter if it's DDR, DDR2, DDR3 or DDR4.
Which is the correct way to get the true latency?
The CAS latency is in memory-bus clock cycles. This is always one half the transfers-per-second number. e.g. DDR3-1600 has a memory clock of 800MHz, doing 1600M transfers per second (during a burst transfer).
DDR2, DDR3, and DDR4 still use a double-pumped 64-bit memory bus (transferring data on the rising and falling edges of the clock signal), not quad-pumped. This is why they're still called Double Data-Rate (DDR) SDRAM.
The FSB speed has nothing to do with it.
On old CPUs without integrated memory controllers, i.e. systems that actually have an FSB, its frequency is often configurable (in the BIOS) separately from the memory speed. See Front Side Bus and RAM speed; on even older systems, the FSB and memory clocks were synchronous.
Normally systems were designed with a fast enough FSB to keep up with the memory controller. Running the FSB at the same clock speed as the memory can reduce latency by avoiding buffering between clock domains.
So yes, the CAS latency in seconds is cycle_count / frequency, or more like your formula
1000ns/us * CL / RAMspeed * 2 transfers/clock, where RAMspeed is in mega-transfers per second.
Higher CL numbers at a higher memory frequency often work out to a similar absolute latency (in seconds). In other words, modern RAM has higher CAS latency timing numbers because more clock cycles happen in the same amount of time.
Bandwidth has vastly improved, while latency has stayed nearly constant, according to these graphs from Crucial which explain CL vs. frequency.
Of course this is not "the memory latency", or the "true" memory latency.
It's the CAS latency of the DRAM itself, and is the most important factor in latency between the memory controller and the DRAM, but is only a part of the latency between a CPU core and memory. There is non-negligible latency inside the CPU between the core and uncore (L3 and memory controller). Uncore is Intel terminology; IDK what AMD calls the parts of the memory hierarchy in their various microarchitectures.
Especially many-core Xeon CPUs have significant latency to L3 / memory controller, because of the large ring bus(es) connecting all the cores. A many-core Xeon has worse L3 and memory latency than a similar dual or quad-core with the same memory and CPU clock frequencies.
This extra latency actually limits single-thread / single-core bandwidth on a big Xeon to worse than a laptop CPU, because a single core can't keep enough requests in flight to fill the memory pipeline with that much latency. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?.
Ok I found the answer.
Everytime the manufacturers increased the memory clock speed they did it in a constant rate which always was the double (2x) of the FSB clock speed. ie
MEM CLK FSB
-------------------
DDR200 100 MHz
DDR266 133 MHz
DDR333 166 MHz
DDR400 200 MHz
DDR2-400 200 MHz
DDR2-533 266 MHz
DDR2-667 333 MHz
DDR2-800 400 MHz
DDR2-1066 533 MHz
DDR3-800 400 MHz
DDR3-1066 533 MHz
DDR3-1333 666 MHz
DDR3-1600 800 MHz
So, the memory module has always the double speed of the FSB.

Calculating Effective CPI when using write-through/write-back architecture

So I'm trying to understand a homework problem given by an instructor and I'm honestly lost - I understand the concept of write-through/write-back, etc. but I can't figure out the actual calculations needed for the effective CPI, could anyone give me a hand? (The problem follows:
The following table provides the statistics of a cache for a
particular program. It is known that the base CPI (without cache
misses) is 1. It is also known that the memory bus bandwidth (the
bandwidth to transfer data between cache and memory) is 4 bytes per
cycle, and it takes one cycle to send the address before data
transfer. The memory spends 10 cycles to store data from bus or fetch
data to bus. The clock rate used by memory and the bus is a quarter of
the CPU clock rate.
Data reads per 1000 instructions: 100
Data writes per 1000 instructions: 150
Instruction cache miss rate: 0.4%
Data cache miss rate: 3%
Block size in bytes: 32
The effective CPI is the base CPU plus the CPI contribution from cache misses.
The cache miss CPI is the sum of the of instruction cache CPI and data cache CPI.
The cache miss cost is the cost of reading or writing to memory, so we will need that.
The cost in bus cycles is 1 (for the address) plus 10 (memory busy time) + 8 (32 byte blocks size divided by 4 bytes/cycle) = 19 cycles. Multiply this by 4 to get CPU cycles. Total is 76 CPU cycles.
So the cost for I cache misses is .004 * 76 = .304 cycles.
The cost for D caches misses is (.10 + .15) * .03 * 76 = .57 cycles
So the effective CPI is 1 + .304 + .57 = 1.874 cycles.

GT200 Single Precision Peak Performance

I was trying to verify the single precision peak performance of a reference GT200 card.
From http://www.realworldtech.com/gt200/9/, we have two facts about GT200 –
The latency of the fastest operation for an SP core is 4 cycles.
The SFU takes 4 cycles too to finish an operation.
Now, each SM has a total of 8 SPs and 2 SFUs, with each SFU having 4 FP multiply units and these SPs and SFUs can work at the same time as they are on two different ports as explained in their SM level diagrams. Each SP can perform MAD operation.
So, we are looking at 8 MAD operations and 8 MUL operations per 4 SP cycles. This gives us 16 + 8 = 24 operations per 4 SP clock cycles as MAD counts as 2 operations. Since 2 SP clock cycle counts as one shader clock, we have 24/2 = 12 operations per shader clock.
For a reference GT200 card, shader clock = 1296 MHz/s.
Thus, the single precision peak performance must be = 1296 MHz/s * 30 SM * 12 operations per shader clock = 466.560 GFLOPS
This is exactly half of the GFLOPS as reported in the specs. So where am I going wrong?
Edit: After Robert’s pointer to the CUDA Programming Guide that says 8MADs/shader clock can be performed in a GT200 SM, I would have to question how latency and throughput relate to each other in this particular SM.
There is a latency of one OP / 4 SP cycles (as pointed out earlier), thus one MAD every 4 SP cycles, right? We have 8 SPs, so it becomes 8 MADs for every 4 SP cycles in an SM.
Since 2 SP cycles form one shader cycle, so we are left with => 8MADs per 2 shader clock cycles
=> 4 MADs per shader clock.
This doesn’t match with the 8MADs/shader clock from the Programming Guide.
So, what am I doing wrong again?
Latency and throughput are not the same thing.
A cc 1.x SM can retire 8 single precision floating point MAD operations on every clock cycle.
This is the correct formula:
1296 MHz(cycle/s) * 30 SM * (8 SP/SM * 2 flop/cycle per SP + 2 SFU/SM * 4 FPU/SFU * 1 flop/cycle per FPU)
= 622080 Mflop/s + 311040 Mflop/s = 933 GFlop/s single precision
From here
EDIT: The 4-cycle latency you're referring to is the latency of a warp (i.e. 32 threads) MAD instruction, as issued to the SM, not the latency of a single MAD operation on a single SP. The FPU in each SP can generate one MAD result per clock, and there are 8 SP's in one SM, so each SM can generate 8 MAD results per clock. Since a warp (32 threads) MAD instruction requires 32 MAD results, it requires 4 total clocks to complete the warp instruction, as issued to the SPs in the SM.
The FPU in the SP can generate one new MAD result per clock. From the standpoint of instruction issue, the fundamental unit is the warp. Therefore a warp MAD instruction requires 4 clocks to complete.
EDIT2: Responding to question below.
Preface: The FPUs in the SFU are not independently schedulable. They only come into play when an instruction is scheduled to the SFUs. There are 4 FPU per SFU, and an SFU instruction requires 16 cycles (since there are 2 SFU/SM) to complete for a warp. If all 4 FPU in both SFUs were fully utilized, that would be 128 (16x4x2) flops produced during the computation of the SFU instruction, in 16 cycles. This is added to the 256 (16x2x8) total flops that could be generated by the "regular" MAD FPUs in the SM during the same time (16 cycles).
Your question seems to be interpreting the observed benchmark result and this statement in the text:
Table III also shows that the throughput for single-precision
floating point multiplication is 11.2 ops/clock, which means
that multiplication can be issued to both the SP and SFU
units. This suggests that each SFU unit is capable of doing
2 multiplications per cycle, twice the throughput of other
(more complex) instructions that map to this unit.
as an indication of either the throughput of the FPUs in the SFU or else the number of FPUs in the SFU. However you are conflating benchmark data with a theoretical number. The SFU has 4 FPU, but this does not mean that all 4 are independently schedulable for arbitrary arithmetic or instruction streams. Seeing all 4 FPU take on a new floating point instruction in a given cycle may require a specific instruction sequence that the authors haven't used.

Calculating CPU Performance in MIPS

i was taking an exam earlier and i memorized the questions that i didnt know how to answer but somehow got it correct(since the online exam using electronic classrom(eclass) was done through the use of multiple choice.. The exam was coded so each of us was given random questions at random numbers and random answers on random choices, so yea)
anyways, back to my questions..
1.)
There is a CPU with a clock frequency of 1 GHz. When the instructions consist of two
types as shown in the table below, what is the performance in MIPS of the CPU?
-Execution time(clocks)- Frequency of Appearance(%)
Instruction 1 10 60
Instruction 2 15 40
Answer: 125
2.)
There is a hard disk drive with specifications shown below. When a record of 15
Kbytes is processed, which of the following is the average access time in milliseconds?
Here, the record is stored in one track.
[Specifications]
Capacity: 25 Kbytes/track
Rotation speed: 2,400 revolutions/minute
Average seek time: 10 milliseconds
Answer: 37.5
3.)
Assume a magnetic disk has a rotational speed of 5,000 rpm, and an average seek time of 20 ms. The recording capacity of one track on this disk is 15,000 bytes. What is the average access time (in milliseconds) required in order to transfer one 4,000-byte block of data?
Answer: 29.2
4.)
When a color image is stored in video memory at a tonal resolution of 24 bits per pixel,
approximately how many megabytes (MB) are required to display the image on the
screen with a resolution of 1024 x768 pixels? Here, 1 MB is 106 bytes.
Answer:18.9
5.)
When a microprocessor works at a clock speed of 200 MHz and the average CPI
(“cycles per instruction” or “clocks per instruction”) is 4, how long does it take to
execute one instruction on average?
Answer: 20 nanoseconds
I dont expect someone to answer everything, although they are indeed already answered but i am just wondering and wanting to know how it arrived at those answers. Its not enough for me knowing the answer, ive tried solving it myself trial and error style to arrive at those numbers but it seems taking mins to hours so i need some professional help....
1.)
n = 1/f = 1 / 1 GHz = 1 ns.
n*10 * 0.6 + n*15 * 0.4 = 12 ns (=average instruction time) = 83.3 MIPS.
2.)3.)
I don't get these, honestly.
4.)
Here, 1 MB is 10^6 bytes.
3 Bytes * 1024 * 768 = 2359296 Bytes = 2.36 MB
But often these 24 bits are packed into 32 bits b/c of the memory layout (word width), so often it will be 4 Bytes*1024*768 = 3145728 Bytes = 3.15 MB.
5)
CPI / f = 4 / 200 MHz = 20 ns.

Resources