In this slide, things looks a little off to me. Clock cycle time or clock period, is already time required per clock cycle. Question is, does the word Clock Rate makes sense?
It also says, Hardware designer must often trade off clock rate against cycle count. But, they are inversely related. If one increases Clock speed, the clock period(time for per clock cycle) will reduce automatically. Why there will be a choice?
Or am I missing something?
First things first, slides aren't always the best way to discuss technical issues. Don't take any slide as gospel. There's a huge amount of handwaving going on to support gigantic claims with so little evidence.
That said, there are tradeoffs:
faster clocks is usually better: get more integer or floating point operations done per second
but if the faster clock doesn't line up well with external memory clocks, some of those cycles might be wasted
slower clocks might draw less power
faster clocks allow an operating system kernel to get more work done with every wakeup and return to sleep faster, thus they might draw less power
faster clocks might mean some operations take more clock cycles to actually execute (think of the supremely deep pipelines of the Pentium IV -- branch mis-predictions were very costly -- despite the faster clock cycle than the Pentium III or Pentium M, in the real world, speeds were very similar for the two processor types.)
Clock Rate simply means frequency, which the reciprocal of the time of a single clock cycle, so the equations make perfect sense.
Regarding the second question, cycle count is the same as "CPU Clock Cycles"; it is not the same as clock period or time per clock cycle.
Related
Let's suppose I change one single bit in a word and add two other words.
Does changing one bit in a word consume less CPU cycles than changing an entire word?
If it consumes less CPU cycles, how much faster would it be?
Performance (in clock cycles) is not data-dependent for integer ALU instructions other than division on most CPUs. ADD and XOR have the same 1-cycle latency on the majority of modern pipelined CPUs. (And the same cycle cost as each other on most older / simpler CPUs, whether or not it's 1 cycle.)
See https://agner.org/optimize/ and https://uops.info/ for numbers on modern x86 CPUs.
Lower power can indirectly affect performance by allowing higher boost clocks without having to slow down for thermal limits. But the difference in this case is so small that I don't expect it would be a measurable difference on a mainstream CPU, like the efficiency cores of an Alder Lake, or even a mobile phone CPU that's more optimized for low power.
Power in a typical CPU (using CMOS logic) scales with how many gates have their outputs change value per cycle. When a transistor switches on, it conducts current from Vcc or to ground, charging or discharging the tiny parasitic capacitance of the things the logic gate's output is connected to. Since the majority of the (low) resistance in the path of that current is in the transistor itself, that's where the electrical energy turns into heat.
For more details, see:
Why does switching cause power dissipation? on electronics.SE for the details for one CMOS gate
For a mathematical operation in CPU, could power consumption depend on the operands?
Modern Microprocessors
A 90-Minute Guide! has a section about power. (And read the whole article if you have any general interest in CPU architecture; it's good stuff.)
ADD does require carry propagation potentially across the whole width of the word, e.g. for 0xFFFFFFFF + 1, so ALUs use tricks like carry-lookahead or carry-select to keep the worst case gate-delay latency within one cycle.
So ADD involves more gates than a simple bitwise operation like XOR, but still not many compared to the amount of gates involved in controlling all the decode and other control logic to get the operands to the ALU and the result written back (and potentially bypass-forwarded to later instructions that use the result right away.)
Also, a typical ALU probably doesn't have fully separate adder vs. bitwise units, so a lot of those adder gates are probably seeing their inputs change, but control signals block carry propagation. (i.e. a typical ALU implements XOR using a lot of the same gates as ADD, but with control signals controlling AND gates or something to all or block carry propagation. XOR is add-without-carry.) An integer ALU in a CPU will usually be at least an adder-subtractor so one of the inputs is coming through multiple gates, with other control signals that can make it do bitwise ops.
But there's still maybe a few fewer bit-flips when doing an XOR operation than an ADD. Partly it would depend what the previous outputs were (of whatever computation it did in the previous cycle, not the value of one of the inputs to the XOR). But with carry propagation blocked by AND gates, flipping the inputs to those gates doesn't change the outputs, so less capacitance is charged or discharged.
In a high-performance CPU, a lot of power is spent on pipelining and out-of-order exec, tracking instructions in flight, and writing back the results. So even the whole ALU ADD operation is a pretty minor component of total energy cost to execute the instruction. Small differences in that power due to operands are an even smaller difference. Pretty much negligible compared to how many gates flip every clock cycle just to get data and control signals sent to the right place.
Another tiny effect: if your CPU didn't do register renaming, then possibly a few fewer transistors might flip (in the register file's SRAM) when writing back the result if it's almost the same as what that register held before.
(Assuming an ISA like x86 where you do xor dst, src for dst ^= src, not a 3-operand ISA where xor dst, src1, src2 could be overwriting a different value if you didn't happen to pick the same register for dst and src1.)
If your CPU does out-of-order exec with register renaming, writes to the register file won't be overwriting the same SRAM cells as the original destination value, so it depends what other values were computed recently in registers.
If you want to see a measurable difference in power, run instructions like integer multiply, or FP mul or FMA. Or SIMD instructions, so the CPU is doing 4x or 8x 32-bit addition or shuffle in parallel. Or 8x 32-bit FMA. The max-power workload on a typical modern x86 CPU is two 256-bit FMAs per clock cycle.
See also:
Do sse instructions consume more power/energy? - Mysticial's answer is excellent, and discusses the race-to-sleep benefit of doing the same work faster and with fewer instructions, even if each one costs somewhat more power.
Why does the CPU get hotter when performing heavier calculations, compared to being idle?
How do I achieve the theoretical maximum of 4 FLOPs per cycle?
I'm designing a benchmark for a critical system operation. Ideally the benchmark can be used to detect performance regressions. I'm debating between using the total time for a large workload passed into the operation or counting the cycles taken by the operation as the measurement criterion for the benchmark.
The time to run each iteration of the operation in question is fast perhaps 300-500 nanoseconds.
A total time is much easier to measure accurately / reliably, and the measurement overhead is irrelevant. It's what I'd recommend, as long as you're sure you can stop your compiler from optimizing across iterations of whatever you're measuring. (Check the generated asm if necessary).
If you think your runtime might be data-dependent and want to look into variation across iterations, then you might consider recording timestamps somehow. But 300 ns is only ~1k clock cycles on a 3.3GHz CPU, and recording a timestamp takes some time. So you definitely need to worry about measurement overhead.
Assuming you're on x86, raw rdtsc around each operation is pretty lightweight, but out-of-order execution can reorder the timestamps with the work. Get CPU cycle count?, and clflush to invalidate cache line via C function.
An lfence; rdtsc; lfence to stop the timing from reordering with each iteration of the workload will block out-of-order execution of the steps of the workload, distorting things. (The out-of-order execution window on Skylake is a ROB size of 224 uops. At 4 per clock that's a small fraction of 1k clock cycles, but in lower-throughput code with stalls for cache misses there could be significant overlap between independent iterations.)
Any standard timing functions like C++ std::chrono will normally call library functions that ultimately use rdtsc, but with many extra instructions. Or worse, will make an actual system call taking well over a hundred clock cycles to enter/leave the kernel, and more with Meltdown+Spectre mitigation enabled.
However, one thing that might work is using Intel-PT (https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing) to record timestamps on taken branches. Without blocking out-of-order exec at all, you can still get timestamps on when the loop branch in your repeat loop executed. This may well be independent of your workload and able to run soon after its issued into the out-of-order part of the core, but that can only happen a limited distance ahead of the oldest not-yet-retired instruction.
I have a fpga that is taking in serial data at a bit rate of say 4.8 kbps.
Now I am not sure what clock frequency my fpga should run at to properly handle the data.
Will the clock speed simply need to be at minimum 4800 Hz?
It goes the other way round: you first have to determine how many clock cycles you need to process a single input "tick". If one cycle is enough to complete your processing, then 4800 Hz might be fine.
But if you need two cycles, then you would probably go with double speed.
This is a pretty generic answer, but your question is also pretty generic, so this is probably the best you can hope for without enhancing your input.
Will the clock speed simply need to be at minimum 4800 Hz?
Theoretical: yes, practical: no.
Theoretical.
You can receive a 4800 Hz signal with a 4800Hz clock but only if the clock is the exact right frequency. (The 4800 Hz will deviate, no clock is perfect). For that you would need something like a PLL which is in a measurement feedback loop looking at the signal en keeping the clock in step.
Practical.
Much easier is to use e.g. a 1MHz FPGA clock and use over-sampling. Even then you have the same problems as with a dedicated clock: you need to know where the bit boundaries are. Again some sort of clock locking or edge recognition mechanism is required. In fact you have to build the equivalent of a PLL but you can do it all using registers and counters.
When running at 1MHz (which is very slow for an FPGA) you have plenty of clock cycles to process your data.
Both methods depend on the protocol you are using which you did not mention. They are only possible for some type of signals/serial protocols. For example if signals are low or high for many clock cycles that would cause problems for either method.
I'm aware of the standard methods of getting time deltas using CPU clock counters on various operating systems. My question is, how do such operating systems account for the change in CPU frequency for power saving purposes. I initially thought this could be explained based on the fact that OS's use specific calls to measure frequency to get the corrected frequency based on which core is being used, what frequency it's currently set to, etc. But then I realized, wouldn't that make any time delta inaccurate if the CPU frequency was lowered and raised back to it's original value in between two clock queries.
For example take the following scenario:
Query the CPU cycles. Operating system lowers CPU frequency for power saving. Some other code is run here. Operating system raises CPU frequency for performance. Query the CPU cycles. Calculate delta as cycle difference divided by frequency.
This would yield an inaccurate delta since the CPU frequency was not constant between the two queries. How is this worked around by the operating system or programs that have to work with time deltas using CPU cycles?
see this wrong clock cycle measurements with rdtsc
there are more ways how to deal with it
set CPU clock to max
read the link above to see how to do it?
use PIT instead of RDTSC
PIT is programmable interrupt timer (Intel 8253 if I remember correctly) it is present on all PC motherboards since x286 (and maybe even before) but the resolution is only ~119KHz and not all OS give you access to it.
combine PIT and RDTSC
just measure the CPU clock by PIT repeatedly when is stable enough start your measurement (and remain scanning for CPU clock change). If CPU clock changes during measurement then throw away the measurement and start again
I have the source code written and I want to measure efficiency as how many clock cycles it takes to complete a particular task. Where can I learn how many clock cycles different commands take? Does every command take the same amount of time on 8086?
RDTSC is the high-resolution clock fetch instruction.
Bear in mind that cache misses, context switches, instruction reordering and pipelining, and multicore contention can all interfere with the results.
Clock cycles and efficiency are not the same thing.
For efficiency of code you need to consider, in particular, how the memory is utalised, in particular the differing levels of the cache. Also important is the branching prediction of the code etc. You want a profiler that tells you these things, ideally one that gives you profile specific information: examples are CodeAnalyst for AMD chips.
To answer your question, particular base instructions do have a given (average) number of cycles (AMD release the approximate numbers for the basic maths functions in their maths library). These numbers are a poor place to start optimising code, however.