Is high resolution performance counter and 64 bit RTC both works based on number of CPU Cycles since the system started? Both uses same hardware?
Jit
It is up to the motherboard + BIOS, the HAL (Hardware Abstraction Layer) picks it up. But it wasn't traditionally and it certainly isn't these days with variable cpu clock rates. The motherboard builder usually picks a frequency available in the chipset. The traditional rate was 1.1932 MHz, the NTSC color burst frequency divided by 3 and the clock source of the Intel 8253 timer chip. But no longer around anymore due to relentless cost cutting.
Always use QueryPerformanceFrequency().
Related
On recent x86, RDTSC returns some pseudo-counter that measures time instead of clock cycles.
Given this, how do I measure actual clock cycles for the current thread/program?
Platform-wise, I prefer Windows, but a Linux answer works too.
This is not simple. Such a thing is described in the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B:
Here is the behaviour:
For Pentium M processors; for Pentium 4 processors, Intel Xeon processors; and for P6 family processors: the time-stamp counter increments
with every internal processor clock cycle. The internal processor clock cycle is determined by the current core-clock to bus-clock ratio. Intel®
SpeedStep® technology transitions may also impact the processor clock.
For Pentium 4 processors, Intel Xeon processors; for Intel Core Solo
and Intel Core Duo processors; for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors; for Intel Core 2 and Intel Xeon processors; for Intel Atom processors: the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted. The maximum resolved frequency may differ from the processor base frequency. On certain processors, the TSC frequency may not be the same as the frequency in the brand string.
Here is the advise for your use-case:
To determine average processor clock frequency, Intel recommends the use of performance monitoring logic to count processor core clocks over the period of time for which the average is required. See Section 18.17, “Counting Clocks on systems with Intel Hyper-Threading Technology in Processors Based on Intel NetBurst® Microarchitecture,” and Chapter 19, “Performance-
Monitoring Events,” for more information.
The bad news is that AFAIK performance counters are often not portable between AMD and Intel processors. Thus, you certainly need to check which performance counters to use in the AMD documentation. There are also complications: you cannot easily measure the number of of cycle taken by any arbitrary code. For example, the processor can be halted or enter in sleep mode for a short period of time (see C-state) or the OS can executing some protected code that cannot be profiled without high privileges (for sake of security). This method is fine as long as you need to measure the number of cycle of a numerically-intensive code taking relatively-long time (at least several dozens of cycles). On top of all of that, the documentation and usage of MSR is pretty complex and it has some restrictions.
Performance counters like CPU_CLK_UNHALTED.THREAD and CPU_CLK_UNHALTED.REF_TSC seems a good start for what you want to measure. Using library to read such performance counter is generally a very good idea (unless you like having a headache for at least few days). PAPI might be enough to do the job for this.
Here is some interesting related posts:
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
How to read performance counters by rdpmc instruction?
Larger caches are usually with longer bitlines or wordlines and thus most likely higher access latency and cycle time.
So, does L2 caches work in the same domain as L1 caches? How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
And related questions are:
Are all function units in a core in the same clock domain?
Are the uncore part all in the same clock domain?
Are cores in the multi-core system synchronous?
I believe clock domain crossing would introduce extra latency. Do most parts in a CPU chip working on the same clock domain?
The private L1i/d caches are always part of each core, not on a separate clock, in modern CPUs1. L1d is very tightly coupled with load execution units, and the L1dTLB. This is pretty universally true across architectures. (VIPT Cache: Connection between TLB & Cache?).
On CPUs with per-core private L2 cache, it's also part of the core, in the same frequency domain. This keeps L2 latency very low by keeping timing (in core clock cycles) fixed, and not requiring any async logic to transfer data across clock domains. This is true on Intel and AMD x86 CPUs, and I assume most other designs.
Footnote 1: Decades ago, when even having the L1 caches on-chip was a stretch for transistor budgets, sometimes just the comparators and maybe tags were on-chip, so that part could go fast while starting to set up the access to the data on external SRAM. (Or if not external, sometimes a separate die (piece of silicon) in the same plastic / ceramic package, so the wires could be very short and not exposed as external pins that might need ESD protection, etc).
Or for example early Pentium II ran its off-die / on-package L2 cache at half core clock speed (down from full speed in PPro). (But all the same "frequency domain"; this was before DVFS dynamic frequency/voltage for power management.) L1i/d was tightly integrated into the core like they still are today; you have to go farther back to find CPUs with off-die L1, like maybe early classic RISC CPUs.
The rest of this answer is mostly about Intel x86 CPUs, because from your mention of L3 slices I think that's what you're imagining.
How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
Of mainstream Intel CPUs (P6 / SnB-family), only Skylake-X has non-inclusive L3 cache. Intel since Nehalem has used inclusive last-level cache so its tags can be a snoop filter. See Which cache mapping technique is used in intel core i7 processor?. But SKX changed from a ring to a mesh, and made L3 non-inclusive / non-exclusive.
On Intel desktop/laptop CPUs (dual/quad), all cores (including their L1+L2 caches) are in the same frequency domain. The uncore (the L3 cache + ring bus) is in a separate frequency domain, but I think normally runs at the speed of the cores. It might clock higher than the cores if the GPU is busy but the cores are all idle.
The memory clock stays high even when the CPU clocks down. (Still, single-core bandwidth can suffer if the CPU decides to clock down from 4.0 to 2.7GHz because it's running memory-bound code on the only active core. Single-core bandwidth is limited by max_concurrency / latency, not by DRAM bandwidth itself if you have dual-channel DDR4 or DDR3. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? I think this is because of increased uncore latency.)
The wikipedia Uncore article mentions overclocking it separately from the cores to reduce L3 / memory latency.
On Haswell and later Xeons (E5 v3), uncore (the ring bus and L3 slices) and each individual core have separate frequency domains. (source: Frank Denneman's NUMA Deep Dive Part 2: System Architecture. It has a typo, saying Haswell (v4) when Haswell is actually Xeon E[357]-xxxx v3. But other sources like this paper Comparisons of core and uncore frequency scaling modes in quantum chemistry application GAMESS confirm that Haswell does have those features. Uncore Frequency Scaling (UFS) and Per Core Power States (PCPS) were both new in Haswell.
On Xeons before Haswell, the uncore runs at the speed of the current fastest core on that package. On a dual-socket NUMA setup, this can badly bottleneck the other socket, by making it slow keeping up with snoop requests. See John "Dr. Bandwidth" McCalpin's post on this Intel forum thread:
On the Xeon E5-26xx processors, the "uncore" (containing the L3 cache, ring interconnect, memory controllers, etc), runs at a speed that is no faster than the fastest core, so the "package C1E state" causes the uncore to also drop to 1.2 GHz. When in this state, the chip takes longer to respond to QPI snoop requests, which increases the effective local memory latency seen by the processors and DMA engines on the other chip!
... On my Xeon E5-2680 chips, the "package C1E" state increases local latency on the other chip by almost 20%
The "package C1E state" also reduces sustained bandwidth to memory located on the "idle" chip by up to about 25%, so any NUMA placement errors generate even larger performance losses.
Dr. Bandwidth ran a simple infinite-loop pinned to a core on the other socket to keep it clocked up, and was able to measure the difference.
Quad-socket-capable Xeons (E7-xxxx) have a small snoop filter cache in each socket. Dual-socket systems simply spam the other socket with every snoop request, using a good fraction of the QPI bandwidth even when they're accessing their own local DRAM after an L3 miss.
I think Broadwell and Haswell Xeon can keep their uncore clock high even when all cores are idle, exactly to avoid this bottleneck.
Dr. Bandwidth says he disables package C1E state on his Haswell Xeons, but that probably wasn't necessary. He also posted some stuff about using Uncore perf counters to measure uncore frequency to find out what your CPU is really doing, and about BIOS settings that can affect the uncore frequency decision-making.
More background: I found https://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4 about some changes like new snoop mode options (which hop on the ring bus sends snoops to the other core), but it doesn't mention clocks.
A larger cache may have a higher access time, but still it could sustain one access per cycle per port by fully pipelining it. But it also may constrain the maximum supported frequency.
In modern Intel processors, the L1i/L1d and L2 caches and all functional units of a core are in the same frequency domain. On client processors, all cores of the same socket are also in the same frequency domain because they share the same frequency regulator. On server processors (starting with Haswell I think), each core in a separate frequency domain.
In modern Intel processors (since Nehalem I think), the uncore (which includes the L3) is in a separate frequency domain. One interesting case is when a socket is used in a dual NUMA nodes configuration. In this case, I think the uncore partition of each NUMA node would still both exist in the same frequency domain.
There is a special circuitry used to cross frequency domains where all cross-domain communication has to pass through it. So yes I think it incurs a small performance overhead.
There are other frequency domains. In particular, each DRAM channel operates in a frequency domains. I don't know whether current processors support having different channels to operate at different frequencies.
Is it possible to query the number of execution unit/port per core and similar information on Intel CPU?
I have an assembly program, and noticed that the performance is quite different on different CPU's. For example, on an Core i5 4570, some functions takes consistently 25% cycles to complete than on an Core i7 4970HQ. They are both Haswell based, from the same generation. No memory movement is involved in the part of program benchmarked. So I am thinking maybe the difference comes from the details such as number of execution unit, number of ports etc. The benchmark measures single core CPU cycles, so frequencies/HT etc does not come into play.
Am I right to assume such an explanation of performance difference? If yes, where can I find such informations for specific CPUs. And is it possible to query it dynamically? If possible, then I can dispatch dynamically based on such informations and distribution uops more evenly and similar techniques to optimize the program for multiple CPUs.
Did you time reference cycles (RDTSC) instead of core clock cycles (with perf counters)? That would explain your observations.
Turbo makes a big difference, and the ratio between max turbo and max sustained / rated clock speed (i.e. reference cycle tick rate) is different on different CPUs. e.g. see my answer on this related question
The lower the CPU's TDP, the bigger the ratio between sustained and peak. The Haswell wikipedia article has tables:
84W desktop i5 4570: sustained 3.2GHz = RDTSC frequency, max turbo 3.6GHz (the speed the core was probably actually running for most of your benchmark, if it had time to go up from low-power idle speed).
47W laptop i7-4960HQ: 2.6GHz sustained = RDTSC frequency vs. 3.8GHz max turbo.
Time your code with performance counters, and look at the "core clock cycles" count. (And lots of other neat stuff).
Every Haswell core is identical from Core-M 5Watt CPUs to high-power quad core to 18-core Xeon (which actually has a per-core power-budget more like a laptop CPU); it's only the L3 caches, number of cores (and interconnect), and support or not for HT and/or Turbo that differ. Basically everything outside the cores themselves can be different, including the GPU. They don't disable execution ports, and even the L1/L2 caches are identical. I think disabling execution ports would require significant redesigns in the out-of-order scheduler and stuff like that.
More importantly, every port has at least one execution unit that isn't found on any other port: p0 has the divider, p1 has the integer multiply unit, p5 has the shuffle unit, and p6 is the only port that can execute predicted-taken branches. Actually, p2 and p3 are identical load ports (and can handle store-address uops)...
See Agner Fog's microarch pdf for more about Haswell internals, and also David Kanter's writeup with diagrams of the different blocks.
(However, it's not strictly true that the entire core is identical: Haswell Pentium/Celeron CPUs don't support AVX/AVX2, or BMI/BMI2. I think they do that by disabling decode of VEX prefixes in the decoders. This is still the case for Skylake Pentiums/Celerons, so thanks Intel for delaying the time when we can assume support for new instruction sets. Presumably they do this so CPUs with defects in one only the upper or lower half of their vector execution units can still be sold as Celeron or Pentium, just like CPUs with a defect in some of their L3 can be sold as i5 instead of i7)
I'm working on building a Dynamic Voltage Frequency Scaling (DVFS) algorithm for a video decoding application operating on an Intel core i7 6500U CPU (Skylake). The application is to support both software as well as hardware decoder modules and the software decoder is working as expected. It controls the operational frequency of the CPU which eventually controls the operational voltage, thereby reducing the overall energy consumption.
My question is regarding the hardware decoder which is available in the Intel skylake processor (Intel HD graphics 520) which performs the hardware decoding. The experimental results for the two decoders suggest that the energy consumption reduction is much less in the hardware decoder compared to the software decoder when using the DVFS algorithm.
Does the CPU frequency level adjusted on the software before passing the video frame to be decoded on the hardware decoder, actually have an impact on the energy consumption of the hardware decoder?.
Does the Intel HD graphics 520 GPU on the same chip as the CPU have any impact on the CPU's operational frequency and the voltage level?
Why did you need to implement your own DVFS in the first place? Didn't Skylake's self-regulating mode work well? (where you let the CPU's hardware power management controller make all the frequency decisions, instead of just choosing whether to turbo or not).
Setting the CPU core clock speeds should have little to no effect on the GPU's DVFS. It's in a separate domain, and not linked to any of the cores (which can each choose their clocks individually). As you can see on Wikipedia, that SKL model can scale its GPU clocks from 300MHz to 1050MHz, and is probably doing so automatically if you're using an OS running Intel's normal graphics drivers.
For more about how Skylake power management works under the hood, see Efraim Rotem's (Lead Client Power Architect) IDF2015 talk (audio+slides, very good stuff). The title is Skylake Deep Dive: A New Architecture to Manage Power Performance and Energy Efficiency.
There's a link to the list of IDF2015 sessions in the x86 tag wiki.
I'm aware of the standard methods of getting time deltas using CPU clock counters on various operating systems. My question is, how do such operating systems account for the change in CPU frequency for power saving purposes. I initially thought this could be explained based on the fact that OS's use specific calls to measure frequency to get the corrected frequency based on which core is being used, what frequency it's currently set to, etc. But then I realized, wouldn't that make any time delta inaccurate if the CPU frequency was lowered and raised back to it's original value in between two clock queries.
For example take the following scenario:
Query the CPU cycles. Operating system lowers CPU frequency for power saving. Some other code is run here. Operating system raises CPU frequency for performance. Query the CPU cycles. Calculate delta as cycle difference divided by frequency.
This would yield an inaccurate delta since the CPU frequency was not constant between the two queries. How is this worked around by the operating system or programs that have to work with time deltas using CPU cycles?
see this wrong clock cycle measurements with rdtsc
there are more ways how to deal with it
set CPU clock to max
read the link above to see how to do it?
use PIT instead of RDTSC
PIT is programmable interrupt timer (Intel 8253 if I remember correctly) it is present on all PC motherboards since x286 (and maybe even before) but the resolution is only ~119KHz and not all OS give you access to it.
combine PIT and RDTSC
just measure the CPU clock by PIT repeatedly when is stable enough start your measurement (and remain scanning for CPU clock change). If CPU clock changes during measurement then throw away the measurement and start again