MIPS System Time - performance

How would I get the current system time with the MIPS instruction set? I would like to benchmark some programs and would like to find the time in milli or nanoseconds that it takes for them to complete.
I am aware that I could run the assembly code from within C and time it with the C time libraries, however, I would like to do this is MIPS assembly alone.

Related

Benchmarking - How to count number of instructions sent to CPU to find consumed MIPS

Consider I have a software and want to study its behavior using a black-box approach. I have a 3.0GHz CPU with 2 sockets and 4 cores. As you know, in order to find out instructions per second (IPS) we have to use the following formula:
IPS = sockets*(cores/sockets)*clock*(instructions/cycle)
At first, I wanted to find number of instructions per cycle for my specific algorithm. Then I realised its almost impossible to count it using a block-box approach and I need to do in-depth analysis of the algorithm.
But now, I have two question: Regardless of what kind of software is running on my machine and its cpu usage, is there any way to count number of instructions per second sent to the CPU (Millions of instructions per second (MIPS))? And is it possible to find the type of instruction set (add, compare, in, jump, etc) ?
Any piece of script or tool recommendation would be appreciated (in any language).
perf stat --all-user ./my_program on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g.
3,496,129,612 instructions:u # 2.61 insn per cycle
It calculates IPC for you; this is usually more interesting than instructions per second. uops per clock is usually even more interesting in terms of how close you are to maxing out the front-end, though. You can manually calculate MIPS from instructions and task-clock. For most other events perf prints a comment with a per-second rate.
(If you don't use --all-user, you can use perf stat -e task-clock:u,instructions:u , ... to have those specific events count in user-space only, while other events can count always, including inside interrupt handlers and system calls.)
But see How to calculate MIPS using perf stat for more detail on instructions / task-clock vs. instructions / elapsed_time if you do actually want total or average MIPS across cores, and counting sleep or not.
For an example output from using it on a tiny microbenchmark loop in a static executable, see Can x86's MOV really be "free"? Why can't I reproduce this at all?
How can I get real-time information at run-time
Do you mean from within the program, to profile only part of it? There's a perf API where you can do perf_event_open or something. Or use a different library for direct access to the HW perf counters.
perf stat is great for microbenchmarking a loop that you've isolated into a stand-alone program that just runs the hot loop for a second or so.
Or maybe you mean something else. perf stat -I 1000 ... ./a.out will print counter values every 1000 ms (1 second), to see how program behaviour changes in real time with whatever time window you want (down to 10ms intervals).
sudo perf top is system-wide, slightly like Unix top
There's also perf record --timestamp to record a timestamp with each event sample. perf report -D might be useful along with this. See http://www.brendangregg.com/perf.html, he mentions something about -T (--timestamp). I haven't really used this; I mostly isolate single loops I'm tuning into a static executable I can run under perf stat.
And is it possible to find the type of instruction set (add, compare, in, jump, etc)?
Intel x86 CPUs at least have a counter for branch instructions, but other types aren't differentiated, other than FP instructions. This is probably common to most architectures that have perf counters at all.
For Intel CPUs, there's ocperf.py, a wrapper for perf with symbolic names for more microarchitectural events. (Update: plain perf now knows the names of most uarch-specific counters so you don't need ocperf.py anymore.)
perf stat -e task_clock,cycles,instructions,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.scalar_double,uops_executed.x87 ./my_program
It's not designed to tell you what instructions are running, you can already tell that from tracing execution. Most instructions are fully pipelined, so the interesting thing is which ports have the most pressure. The exception is the divide/sqrt unit: there's a counter for arith.divider_active: "Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations". The divider isn't fully pipelined, so a new divps or sqrtps can't always start even if no older uops are ready to execute on port 0. (http://agner.org/optimize/)
Related: linux perf: how to interpret and find hotspots for using perf to identify hotspots. Especially using top-down profiling you have perf sample the call-stack to see which functions make a lot of expensive child calls. (I mention this in case that's what you really wanted to know, rather than instruction mix.)
Related:
How do I determine the number of x86 machine instructions executed in a C program?
How to characterize a workload by obtaining the instruction type breakdown?
How do I monitor the amount of SIMD instruction usage
For exact dynamic instruction counts, you might use an instrumentation tool like Intel PIN, if you're on x86. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.
perf stat counts for the instructions:u hardware even should also be more or less exact, and is in practice very repeatable across runs of the same program doing the same work.
On recent Intel CPUs, there's HW support for recording which way conditional / indirect branches went, so you can reconstruct exactly which instructions ran in which order, assuming no self-modifying code and that you can still read any JIT buffers. Intel PT.
Sorry I don't know what the equivalents are on AMD CPUs.

Alternative to Intel C++ compiler for Windows and OSX, which provides CPU dispatching

I'm quite honestly sick of Intel compiler now, because it's just buggy, sometimes just generates incorrect crashing code, which is especially bad, since the compilation takes like 2 hours, so there's really no way to try to get around it. Profile guided optimizations, which are needed to make executables at least reasonably sized, always generate crashing code currently for me, so...
But it has one perk no other compiler I know has - dispatching to instruction sets, which is essential for my use - signal processing. Is there any other compiler, that can do that?
(for the record I'm even ok with "pragming" every loop, that would need the CPU dispatching, and no need for nonlooped operations probably)

Execution time java program

First of all, here's just something I'm curious about
I've made a little program which fills some templates with values and I noticed that every time I run it the execution time changes a little bit, it ranges from 0.550s to 0.600s. My CPU is running at 2.9GHZ if that could be useful.
The instructions are always the same, is it maybe something that has to do with physics or something more software oriented?
it has to do with java running on a virtual machine; even a c program might run different times slightly longer/shorter, also the operating system steers when a program has resources (cpu time, memory …) to be executed.

Intel Parallel Studio timing inconsistencies

I have some code that uses Intel TBB and I'm running on a 32 core machine. In the code, I use
parallel_for(blocked_range (2,left_image_width-2, left_image_width /32) ...
to spawn 32 to threads that do concurrent work, there are no race conditions and each thread is hopefully given the same amount of work. I'm using clock_t to measure how long my program takes. For a certain image, it takes roughly 19 seconds to complete.
Then I ran my code through Intel Parallel Studio and it ran the code in 2 seconds. This is the result I was expecting but I can't figure out why there's such a large difference between the two. Does time_t sum the clock cycles on all the cores? Even then it doesn't make sense. Below is the snippet in question.
clock_t begin=clock();
create_threads_and_do_work();
clock_t end=clock();
double diffticks=end-begin;
double diffms=(diffticks*1000)/CLOCKS_PER_SEC;
cout<<"And the time is "<<diffms<<" ms"<<endl;
Any advice would be appreciated.
It's isn't quite clear if the difference in run time is a result of two different inputs (images) or simply two different run-time measuring methods (clock_t difference vs. Intel software measurement). Furthermore, you aren't showing us what goes on in create_threads_and_do_work(), and you didn't mention what tool within Intel Parallel Studio you are using, is it Vtune?
Your clock_t difference method will sum the processing time of the thread that called it (the main thread in your example), but it might not count the processing time of the threads spawned within create_threads_and_do_work(). Whether it does or doesn't depends on whether within that function you wait for all threads to complete and only then exit the function or if you simply spawn the threads and exit immediately (before they complete processing). If all you do in the function is that parallel_for(), then the clock_t difference should yield the right result and should be no different than other run-time measurements.
Within Intel Parallel Studio there is a profiling tool called Vtune. is a powerful tool and When you run your program through it you can view (in a graphically pleasing way) the processing time (as well as times called) of each function in your code. I'm pretty sure after doing this you'll probably figure it out.
One last idea - did the program complete its course when using Intel software? I'm asking because sometimes Vtune will collect data for some time and then stop without allowing the program to complete.

How to measure x86 and x86-64 assembly commands execution time in processor cycles? [duplicate]

This question already has answers here:
How many CPU cycles are needed for each assembly instruction?
(5 answers)
Closed 3 years ago.
I want to write a bunch of optimizations for gcc using genetic algorithms.
I need to measure execution time of an assembly functions for some stats and fit functions.
The usual time measurement can't be used, 'cause it is influenced by the cache size.
So I need a table where I can see something like this.
command | operands | operands sizes | execution cycles
Am I missunderstanding something?
Sorry for bad English.
With modern CPU's, there are no simple tables to look up how long an instruction will take to complete (although such tables exist for some old processors, e.g. 486). Your best information on what each instruction does and how long it might take comes from the chip manufacturer. E.g. Intel's documentation manuals are quite good (there's also an optimisation manual on that page).
On pretty much all modern CPU's there's also the RDTSC instruction that reads the time stamp counter for the processor on which the code is running into EDX:EAX. There are pitfalls with this also, but essentially if the code you are profiling is representative of a real use situation, its execution doesn't get interrupted or shifted to another CPU core, then you can use this instruction to get the timings you want. I.e. surround the code you are optimising with two RDTSC instructions and take the difference in TSC as the timing. (Variances on timings in different tests/situations can be great; statistics is your friend.)
reading the system clock value?
You can instrument your code using assembly (rdtsc and friends) or using a instrumentation API like PAPI. Accurately measuring clock cycles that were spent during the execution of one instruction is not possible, however - you can refer to your architecture developer manuals for the best estimates.
In both cases, you should be careful when taking into account effects from running on a SMP environment.

Resources