Instruction to get the current time on x86 - performance

Is there an x86 instruction to get the current time?
Basically... something like a replacement for clock_get_time ... something with the minimum overhead... where I don't really care about getting the time in any specific format... as long as it's a format I can use.
Basically I'm doing some work to "Detect how much PHYSICAL REAL LIFE TIME" has gone by... and I want to be able to measure time as frequently as possible!
I guess you can imagine i'm doing something like a profiling app... :)
I really need aggressively efficient access to the hardware time. So ideally... some ASM to get the time... store it somewhere... then massage it later into some format that I can actually process.
I'm not interested in _rdtsc as that measures the number of cycles gone by. I need to know how much physical time has executed... not cycles which can vary due to thermal fluctations or so..

For profiling, often it's most useful to profile in terms of CPU clock cycles, rather than wall-clock time. CPU dynamic clocking (turbo and power saving) makes it annoying to get the CPU ramped up to full speed before the start of a measurement period.
If you still need wall-clock time after that:
Recent x86 CPUs have a TSC that runs at a fixed rate, regardless of CPU frequency adjustment for power-saving. Also, the TSC doesn't stop when the CPU is halted. (i.e. no work to do, so it ran the HLT instruction to wait for an interrupt in low-power mode.)
It turned out that efficient access to a useful time-source was more useful to have in hardware than an actual clock cycle counter, so that's what RDTSC morphed into, a few CPU generations after its introduction. Now we're back to using hardware performance counters for measuring clock cycles.
In Linux, look for constant_tsc and nonstop_tsc in the CPU features flags in /proc/cpuinfo. IDK if there are CPUID bits for those. If no, use Linux's code for it (if you can use GPLed code).
On a CPU with those two key features, Linux uses the TSC as its clocksource, IIRC.
The lowest overhead way to get the current time in user-space will be to work out the conversion between RDTSC ticks and real time. While profiling, you might just store 64bit TSC snapshots, and convert to real-time later. (So you can handle TSC wraparound then). RDTSC only takes about 24 cycles (Agner Fog's instruction table, Intel Haswell). I think the overhead of a system call will be an order of magnitude higher than that. (The kernel will have to do a RDTSC in there somewhere anyway).
Agner Fog has documented his profiling / timing methods, and has some example code. I haven't looked recently, but it might have useful stuff for this application.

Related

Are there deterministic architecture emulators available?

Does such a thing as a deterministic (as in same result every run) architecture emulator exist? It is to benchmark test compilers/interpreters.
I do not mean an emulator that simply runs your program on whatever simulated architecture, but something that would compute an efficiency/speed index based on the analysis of the generated code (such as, the thing would have a deterministic value for the time taken by each instruction).
I can compute benchmark statistics on a real machine, but a deterministic result would eliminate the particularities of my machine and allow me to see the effect of small optimizations.
Intel's IACA is a static analysis tool. What is IACA and how do I use it?. But it only works for a single loop and doesn't model cache effects, only the pipeline. (And it assumes nearly-ideal OoO scheduling, I think, so probably doesn't find ROB-size limits, only front-end vs. execution port vs. loop-carried dependency latency bottlenecks). Plus IACA has some bugs in its cost model (e.g. its unlamination rules for micro-fusion of indexed addressing modes are wrong for Haswell).
AFAIK, there are no cycle accurate x86 simulators publicly available for any modern micro-architecture. We only have emulators that don't even try to run at the same speed as any real hardware, just as fast as possible, like BOCHS and qemu. I'm sure Intel and AMD have simulator software internally to validate CPU designs and model their performance, though.
You could probably assign a cycle cost to every instruction in an interpreting emulator like BOCHS and get a deterministic number, and maybe model the cache, too (there are cache simulators). It would be the same every time you ran it, but it wouldn't correspond to the running time on any real hardware!
Being deterministic is nowhere near sufficient to be interesting for tuning software. Modern x86 CPUs have a lot of microarchitectural state for out-of-order execution. We can often predict very close to how they'll run a loop (http://agner.org/optimize/, and other performance links in the x86 tag wiki), but on a larger scale there are many things that are only known by the vendors so so we couldn't write a truly accurate simulator even if we had the time. Things like branch-prediction are known in general terms, but the details have not been reverse-engineered in full detail. But branch prediction is a critical part of making a heavily pipelined CPU sustain anywhere near 3 to 4 fused-domain (front-end) uops per clock in real code.
Things get even more complicated if you want to model a multi-core machine, and SMT / HT adds lots of complexity between threads sharing a core. It's barely deterministic in the real hardware because small timing variations can lead to different threads getting farther out of sync.
To be really useful, you'd want to be able to test your code on Sandybridge, Haswell, Skylake, Bulldozer, Ryzen, and maybe Silvermont. And maybe different variants of those with different amounts of cache, and server vs. desktop where L3 / memory latency differs. (Many-core servers have significantly worse uncore latency, and lower single-threaded bandwidth even though the aggregate bandwidth is higher.)
So the whole idea of a deterministic simulator for "the x86 architecture" is weird. You could make one as simply as by giving each instruction a cost of 1 cycle, but that would be totally unrealistic.

Benchmarking - How to count number of instructions sent to CPU to find consumed MIPS

Consider I have a software and want to study its behavior using a black-box approach. I have a 3.0GHz CPU with 2 sockets and 4 cores. As you know, in order to find out instructions per second (IPS) we have to use the following formula:
IPS = sockets*(cores/sockets)*clock*(instructions/cycle)
At first, I wanted to find number of instructions per cycle for my specific algorithm. Then I realised its almost impossible to count it using a block-box approach and I need to do in-depth analysis of the algorithm.
But now, I have two question: Regardless of what kind of software is running on my machine and its cpu usage, is there any way to count number of instructions per second sent to the CPU (Millions of instructions per second (MIPS))? And is it possible to find the type of instruction set (add, compare, in, jump, etc) ?
Any piece of script or tool recommendation would be appreciated (in any language).
perf stat --all-user ./my_program on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g.
3,496,129,612 instructions:u # 2.61 insn per cycle
It calculates IPC for you; this is usually more interesting than instructions per second. uops per clock is usually even more interesting in terms of how close you are to maxing out the front-end, though. You can manually calculate MIPS from instructions and task-clock. For most other events perf prints a comment with a per-second rate.
(If you don't use --all-user, you can use perf stat -e task-clock:u,instructions:u , ... to have those specific events count in user-space only, while other events can count always, including inside interrupt handlers and system calls.)
But see How to calculate MIPS using perf stat for more detail on instructions / task-clock vs. instructions / elapsed_time if you do actually want total or average MIPS across cores, and counting sleep or not.
For an example output from using it on a tiny microbenchmark loop in a static executable, see Can x86's MOV really be "free"? Why can't I reproduce this at all?
How can I get real-time information at run-time
Do you mean from within the program, to profile only part of it? There's a perf API where you can do perf_event_open or something. Or use a different library for direct access to the HW perf counters.
perf stat is great for microbenchmarking a loop that you've isolated into a stand-alone program that just runs the hot loop for a second or so.
Or maybe you mean something else. perf stat -I 1000 ... ./a.out will print counter values every 1000 ms (1 second), to see how program behaviour changes in real time with whatever time window you want (down to 10ms intervals).
sudo perf top is system-wide, slightly like Unix top
There's also perf record --timestamp to record a timestamp with each event sample. perf report -D might be useful along with this. See http://www.brendangregg.com/perf.html, he mentions something about -T (--timestamp). I haven't really used this; I mostly isolate single loops I'm tuning into a static executable I can run under perf stat.
And is it possible to find the type of instruction set (add, compare, in, jump, etc)?
Intel x86 CPUs at least have a counter for branch instructions, but other types aren't differentiated, other than FP instructions. This is probably common to most architectures that have perf counters at all.
For Intel CPUs, there's ocperf.py, a wrapper for perf with symbolic names for more microarchitectural events. (Update: plain perf now knows the names of most uarch-specific counters so you don't need ocperf.py anymore.)
perf stat -e task_clock,cycles,instructions,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.scalar_double,uops_executed.x87 ./my_program
It's not designed to tell you what instructions are running, you can already tell that from tracing execution. Most instructions are fully pipelined, so the interesting thing is which ports have the most pressure. The exception is the divide/sqrt unit: there's a counter for arith.divider_active: "Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations". The divider isn't fully pipelined, so a new divps or sqrtps can't always start even if no older uops are ready to execute on port 0. (http://agner.org/optimize/)
Related: linux perf: how to interpret and find hotspots for using perf to identify hotspots. Especially using top-down profiling you have perf sample the call-stack to see which functions make a lot of expensive child calls. (I mention this in case that's what you really wanted to know, rather than instruction mix.)
Related:
How do I determine the number of x86 machine instructions executed in a C program?
How to characterize a workload by obtaining the instruction type breakdown?
How do I monitor the amount of SIMD instruction usage
For exact dynamic instruction counts, you might use an instrumentation tool like Intel PIN, if you're on x86. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.
perf stat counts for the instructions:u hardware even should also be more or less exact, and is in practice very repeatable across runs of the same program doing the same work.
On recent Intel CPUs, there's HW support for recording which way conditional / indirect branches went, so you can reconstruct exactly which instructions ran in which order, assuming no self-modifying code and that you can still read any JIT buffers. Intel PT.
Sorry I don't know what the equivalents are on AMD CPUs.

How to monitor the utilization of cores on Xeon Phi at 10Hz?

I've been trying to measure/monitor the utilization of all those 60 cores on Xeon Phi (Knights Corner, in-order processors) at a relatively high frequency, say, at least every 0.1s which yields to 10Hz.
I tried the latest PAPI library. But it only supports PAPI_TOT_INS which is the counter of completed instructions. This won't work because I actually need something related to the instructions issued every 0.1s, not finished. Several instructions issued at different cycles may finish at the same cycle. The issue of instructions is influenced by whether the core is halted or not.
Other commands available like 'top' and 'perf' operate at 1Hz which is too slow for my measurement. I need a higher frequency. And, I also need to synchronize the measurement with vital phases of my codes. So, the Intel Vtune Profile does not work for me either.
Is there a possible way for me to monitor the issue of instructions on Xeon Phi or any other activities linked to their utilizations? I understand that those hardware counters are there, but to read them seems very challenging to me. Maybe I can deduce this utilization by measuring the CPU time of each thread?
Thanks.

Why isn't RDTSC a serializing instruction?

The Intel manuals for the RDTSC instruction warn that out of order execution can change when RDTSC is actually executed, so they recommend inserting a CPUID instruction in front of it because CPUID will serialize the instruction stream (CPUID is never executed out of order). My question is simple: if they had the ability to make instructions serializing, why didn't they make RDTSC serializing? The entire point of it appears to be to get cycle accurate timings. Is there a situation under which you would not want to precede it with a serializing instruction?
Newer Intel CPUs have a separate RDTSCP instruction that is serializing. Intel opted to introduce a separate instruction rather than change the behavior of RDTSC, which suggests to me that there has to be some situation where a potentially out of order timing is what you want. What is it?
The time stamp counter was introduced on the Pentium microarchitecture. Out-of-order execution didn't show up until the Pentium Pro. Intel could have made rdtsc serializing (architecturally or internally), but it seems that they decided to keep it non-serializing, which is OK for general-purpose time measurements, and leave it up to the programmer to add serializing instructions if necessary. This is good for reducing the overhead of the measurement.
That's actually confirmed in the document you provide, with the following comment about Pentium and Pentium/MMX (in 4.2, slightly paraphrased):
All of the rules and code samples described in section 4.1 (Pentium Pro and Pentium II) also apply to the Pentium and Pentium/MMX. The only difference is, the CPUID instruction is not necessary for serialization.
And, from Wikipedia:
The Time Stamp Counter is a 64-bit register present on all x86 processors since the Pentium.
: : :
Starting with the Pentium Pro, Intel processors have supported out-of-order execution, where instructions are not necessarily performed in the order they appear in the executable. This can cause RDTSC to be executed later than expected, producing a misleading cycle count.
One of the two uses of RDTSCP is to give you the processor ID in addition to the time stamp information (it's right there in the name Read Time-Stamp Counter *AND* Processor ID), which is useful on systems with unsynced TSCs across cores or sockets (See: How to get the CPU cycle count in x86_64 from C++?). The additional serialization properties of rdtscp makes it more convenient at the end of the region of interest (See: Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?).
If you are trying to use rdtsc to see if a branch mispredicts, the non-serializing version is what you want.
//math here
rdtsc
branch if zero to done
//do some work that always takes 1 cycle
done: rdtsc
If the branch is predicted correctly, the delta will be small (maybe even negative?). If the branch is mispredicted, the delta will be large.
With the serializing version, the branch condition will be resolved because the first rdtsc waits for the math to finish.
why didn't they make RDTSC serializing? The entire point of it appears to be to get cycle accurate timings
Well, most of the time it's to get high-resolution timestamps. At least some of the time, these timestamps are used for performance metrics. Making the intruction serializing would likely require a pipeline flush, which can be very expensive for CPU-bound applications.
Intel opted to introduce a separate instruction rather than change the behavior of RDTSC, which suggests to me that there has to be some situation where a potentially out of order timing is what you want.
Changing the behavior is almost always undesirable. Intel's customers would be disappointed to find out that RDTSC does something different on newer parts.
As paxdiably explains, RDTSC predates the concept of "serializing" instructions because it was implemented on an in-order CPU. Adding that behavior later would change the memory access behavior of code using it, and thus be incompatible for some purposes.
Instead, more recent CPUs have a related RDTSCP instruction that is defined as serializing (actually stronger: it promises to wait until all instructions issued before it have completed, not just that memory accesses have been done), for exactly this reason. Use that if you are running on modern CPUs.

What is high-resolution performance counter?

In the Win32 API there is a function QueryPerformanceCounter that queries the value of a very high-resolution performance timer.
What is "high-resolution performance timer"? Is it supported by hardware? What systems does not support it?
Under Windows 7 on present generation processors, this is a reliable high precision (nanosecond) timer inside the CPU (HPET).
Under previous versions and on previous generations of processors, it is "something", which can mean pretty much anything. Most commonly, it is the value returned by the RDTSC instruction (or an equivalent, on non-x86), which may or may not be reliable and clock-independent. Note that RDTSC (originally, by definition, but not any more now) does not measure time, it measures cycles.
On current-and-previous-generation CPUs, RDTSC is usually reliable and clock-independent (i.e. it is now really measuring time), on pre-previous generation, especially on mobile or some multi-cpu rigs it is not. The "timer" may accelerate and decelerate, and even be different on different CPUs, causing "time travel".
Edit: The constant tsc flag in cpuid(0x80000007) can be used to tell whether RDTSC is reliable or not (though this does not really solve the problem, because what to do if it isn't, if there is no alternative...).
On yet older systems (like, 8-10 years old), some other timers may be used for QueryPerformanceCounter. Those may neither have high resolution at all, nor be terribly accurate.
High resolution performance counters are usually pulled from the rdtsc instruction, which is an x86-specific way to fetch the number of CPU ticks that have occured since boot. The value of it is very precise, usually down to 100ns accuracy.
Compare this to GetTickCount(), which has an accuracy of roughly ~16ms.
On other architectures (which are out of the scope of Win32 APIs, since they only run on x86-based instruction sets) there may be different ways of doing this. For example, on ARM you can use the System Control Coprocessor (CP15) to do something similar.

Resources