Most relevant performance indicators for C/C++

Most relevant performance indicators for C/C++ - performance

I am looking for relevant performance indicators to benchmark and optimize my C/C++ code. For example, virtual memory usage is a simple but efficient indicator, but I know some are more specialized and help in optimizing specific domains : cache hits/misses, context switches, and so on.
I believe here is a good place to have a list of performance indicators, what they measure, and how to measure them, in order to help people who want to start optimizing their programs know where to start.

Time is the most relevant indicator.
This is why most profilers default to measuring / sampling time or core clock cycles. Understanding where your code spends its time is an essential first step to looking for speedups. First find out what's slow, then find out why it's slow.
There are 2 fundamentally different kinds of speedups you can look for, and time will help you find both of them.
Algorithmic improvements: finding ways to do less work in the first place. This is often the most important kind, and the one Mike Dunlavey's answer focuses on. You should definitely not ignore this. Caching a result that's slow to recompute can be very worth it, especially if it's slow enough that loading from DRAM is still faster.
Using data structures / algorithms that can more efficiently solve your problem on real CPUs is somewhere between these two kinds of speedups. (e.g. linked lists are in practice often slower than arrays because pointer-chasing latency is a bottleneck, unless you end up copying large arrays too often...)
Applying brute force more efficiently to do the same work in fewer cycles. (And/or more friendly to the rest of the program with smaller cache footprint and/or less branching that takes up space in the branch predictors, or whatever.)
Often involves changing your data layout to be more cache friendly, and/or manually vectorizing with SIMD. Or doing so in a smarter way. Or writing a function that handles a common special case faster than your general-case function. Or even hand-holding the compiler into making better asm for your C source.
Consider summing an array of float on modern x86-64: Going from latency-bound scalar addition to AVX SIMD with multiple accumulators can give you a speedup of 8 (elements per vector) * 8 (latency / throughput on Skylake) = 64x for a medium-sized array (still on a single core/thread), in the theoretical best case where you don't run into another bottleneck (like memory bandwidth if your data isn't hot in L1d cache). Skylake vaddps / vaddss has 4 cycle latency, and 2-per-clock = 0.5c reciprocal throughput. (https://agner.org/optimize/). Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about multiple accumulators to hide FP latency. But this still loses hard vs. storing the total somewhere, and maybe even updating the total with a delta when you change an element. (FP rounding error can accumulate that way, though, unlike integers.)
If you don't see an obvious algorithmic improvement, or want to know more before making changes, check whether the CPU is stalling on anything, or if it's efficiency chewing through all the work the compiler is making it do.
Instructions per clock (IPC) tells you whether the CPU is close to its max instruction throughput or not. (Or more accurately, fused-domain uops issued per clock on x86, because for example one rep movsb instruction is a whole big memcpy and decodes to many many uops. And cmp/jcc fuses from 2 instructions to 1 uop, increasing IPC but the pipeline width is still fixed.)
Work done per instruction is a factor, too, but isn't something you can measure with a profiler: if you have the expertise, look at compiler-generated asm to see if the same work with fewer instructions is possible. If the compiler didn't auto-vectorize, or did so inefficiently, you can maybe get a lot more work done per instruction by manually vectorizing with SIMD intrinsics, depending on the problem. Or by hand-holding the compiler into emitting better asm by tweaking your C source to compute things in a way that is natural for asm. e.g. What is the efficient way to count set bits at a position or lower?. And see also C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
If you find low IPC, figure out why by considering possibilities like cache misses or branch misses, or long dependency chains (often a cause of low IPC when not bottlenecked on the front-end or memory).
Or you might find that it's already close to optimally applying the available brute force of the CPU (unlikely but possible for some problems). In that case your only hope is algorithmic improvements to do less work.
(CPU frequency isn't fixed, but core clock cycles is a good proxy. If your program doesn't spend time waiting for I/O, then core clock cycles is maybe more useful to measure.)
A mostly-serial portion of a multi-threaded program can be hard to detect; most tools don't have an easy way to find threads using cycles when other threads are blocked.
Time spent in a function isn't the only indicator, though. A function can make the rest of the program slow by touching a lot of memory, resulting in eviction of other useful data from cache. So that kind of effect is possible. Or having a lot of branches somewhere can maybe occupy some of the branch-prediction capacity of the CPU, resulting in more branch misses elsewhere.
But note that simply finding where the CPU is spending a lot of time executing is not the most useful, in a large codebase where functions containing hotspots can have multiple callers. e.g. lots of time spent in memcpy doesn't mean you need to speed up memcpy, it means you need to find which caller is calling memcpy a lot. And so on back up the call tree.
Use profilers that can record stack snapshots, or just hit control-C in a debugger and look at the call stack a few times. If a certain function usually appears in the call stack, it's making expensive calls.
Related: linux perf: how to interpret and find hotspots, especially Mike Dunlavey's answer there makes this point.
Algorithmic improvements to avoid doing work at all are often much more valuable than doing the same work more efficiently.
But if you find very low IPC for some work you haven't figured out how to avoid yet, then sure take a look at rearranging your data structures for better caching, or avoiding branch mispredicts.
Or if high IPC is still taking a long time, manually vectorizing a loop can help, doing 4x or more work per instruction.

#PeterCordes answers are always good. I can only add my own perspective, coming from about 40 years optimizing code:
If there is time to be saved (which there is), that time is spent doing something unnecessary, that you can get rid of if you know what it is.
So what is it? Since you don't know what it is, you also don't know how much time it takes, but it does take time. The more time it takes, the more worthwhile it is to find, and the easier it is to find it. Suppose it takes 30% of the time. That means a random-time snapshot has a 30% chance of showing you what it is.
I take 5-10 random snapshots of the call stack, using a debugger and the "pause" function.
If I see it doing something on more than one snapshot, and that thing can be done faster or not at all, I've got a substantial speedup, guaranteed.
Then the process can be repeated to find more speedups, until I hit diminishing returns.
The important thing about this method is - no "bottleneck" can hide from it. That sets it apart from profilers which, because they summarize, speedups can hide from them.

Related

Optimizing double->long conversion with memory operand using perf?

I have an x86-64 Linux program which I am attempting to optimize via perf. The perf report shows the hottest instructions are scalar conversions from double to long with a memory argument, for example:
cvttsd2si (%rax, %rdi, 8), %rcx
which corresponds to C code like:
extern double *array;
long val = (long)array[idx];
(This is an unusual bottleneck but the code itself is very unusual.)
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question? What other data should I collect and how should I proceed to optimize this?
Some things I have looked at already. CPU counter results show 1.5% cache misses per instruction:
30686845493287      cache-references
     2140314044614      cache-misses           #    6.975 % of all cache refs
          52970546      page-faults
  1244774326560850      instructions
   194784478522820      branches
     2573746947618      branch-misses          #    1.32% of all branches
          52970546      faults
Top-down performance monitors show we are primarily backend-bound:
frontend bound retiring bad speculation backend bound
10.1% 25.9% 4.5% 59.5%
Ad-hoc measurement with top shows all CPUs pegged at 100% suggesting we are not waiting on memory.
A final note of interest: when run on AWS EC2, the code is dramatically slower (44%) on AMD vs Intel with the same core count. (Tested on Ice Lake 8375C vs EPYC 7R13). What could explain this discrepancy?
Thank you for any help.

To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question?
I think there is two main reason for this instruction to be slow. 1. There is a dependency chain and the latency of this instruction is a problem since the processor is waiting on it to execute other instructions. 2. There is a cache miss (saturating the memory with such instruction is improbable unless many cores are doing memory-based operations).
First of all, tracking what is going on on a specific instruction is hard (especially if the instruction is not executed a lot of time). You need to use precise events to track the root of the problem, that is, events for which the exact instruction addresses that caused the event are available. Only a (small) subset of all events are precise one.
Regarding (1), the latency of the instruction should be about 12 cycles on both architecture (although it might be slightly more on the AMD processor, I do not expect a 44% difference). The target processor are able to execute multiple instruction at the same time in a given cycle. Instructions are executed on different port and are also pipelined. The port usage matters to understand what is going on. This means all the instruction in the hot loop matters. You cannot isolate this specific instruction. Modern processors are insanely complex so a basic analysis can be tricky. On Ice Lake processors, you can measure the average port usage with events like UOPS_DISPATCHED.PORT_XXX where XXX can be 0, 1, 2_3, 4_9, 5, 6, 7_8. Only the first three matters for this instruction. The EXE_ACTIVITY.XXX events may also be useful. You should check if a port is saturated and which one. AFAIK, none of these events are precise so you can only analyse a block of code (typically the hot loop). On Zen 3, the ports are FP23 and FP45. IDK what are the useful events on this architecture (I am not very familiar with it).
On Ice Lake, you can check the FRONTEND_RETIRED.LATENCY_GE_XXX events where XXX is a power of two integer (which should be precise one so you can see if this instruction is the one impacting the events). This help you to see whether the front-end or the back-end is the limiting factor.
Regarding (2), you can check the latency of the memory accesses as well as the number of L1/L2/L3 cache hits/misses. On Ice Lake, you can use events like MEM_LOAD_RETIRED.XXX where XXX can be for example L1_MISS L1_HIT, L2_MISS, L2_HIT, L3_MISS and L3_HIT. Still on Ice Lake, t may be useful to track the latency of the memory operation with MEM_TRANS_RETIRED.LOAD_LATENCY_GT_XXX where XXX is again a power of two integer.
You can also use LLVM-MCA to simulate the scheduling of the loop instruction statically on the target architecture (do not consider branches). This is very useful to understand deeply what the scheduler can do pretty easily.
What could explain this discrepancy?
The latency and reciprocal throughput should be about the same on the two platform or at least close. That being said, for the same core count, the two certainly do not operate at the same frequency. If this is not coming from that, then I doubt this instruction is actually the problem alone (tricky scheduling issues, wrong/inaccurate profiling results, etc.).
CPU counter results show 1.5% cache misses per instruction
The thing is the cache-misses event is certainly not very informative here. Indeed, it references the last-level cache (L3) misses. Thus, it does not give any information about the L1/L2 misses (previous events do).
how should I proceed to optimize this?
If the code is latency bound, the solution is to first break any dependency chain in this loop. Unrolling the loop dans rewriting it so to make it more SIMD-friendly can help a lot to improve performance (the reciprocal throughput of this instruction is about 1 cycle as opposed to 12 for the latency so there is a room for improvements in this case).
If the code is memory bound, they you should care about data locality. Data should fit in the L1 cache if possible. There are many tips to do so but it is hard to guide you without more context. This includes for example sorting data, reordering loop iterations, using smaller data types.
There are many possible source of weird unusual unexpected behaviours that can occurs. If such a thing happens, then it is nearly impossible to understand what is going on without the exact code executed. All details matter in this case.

why is the cache-misses to instructions ratio a better indicator of cache performance compared to the cache-references to cache-misses ratio?

I am using perf to test the code of a theoretically proven to be cache friendly algorithm.
According to this article the cache-misses to instructions is a good indicator of cache performance.
The ratio of cache-misses to instructions will give an indication how
well the cache is working; the lower the ratio the better. In this
example the ratio is 1.26% (6,605,955 cache-misses/525,543,766
instructions). Because of the relatively large difference in cost
between the RAM memory and cache access (100’s cycles vs <20 cycles)
even small improvements of cache miss rate can significantly improve
performance. If the cache miss rate per instruction is over 5%,
further investigation is required.
However when I run perf like this:
perf stat -B -e cache-references,cache-misses,instructions ./td 1.txt 2.txt
Perf will print the following:
Performance counter stats for './td 1.txt 2.txt':
93,497,101 cache-references
56,452,246 cache-misses # 60.379 % of all cache refs
8,115,626,200 instructions
2.509309040 seconds time elapsed
So it focuses more on the cache-references to cache-misses ratio instead of the one suggested in the article.
The cache-misses to cache-references ratio seems very bad, 60%, which means 60% of the time my application accesses the cache I get a cache miss. On the other hand the cache-misses to instructions ratio is only 0.6%.
I am not sure what to get out of this. Which ratio should I aim to optimize?

They're ultimately both misleading, but in different ways.
Misses to hits is interesting to know, however you can "soak" some misses by having lots of arithmetic to do while the miss is being handled. Misses to #instructions would tell you something about that, in that a very low number of misses/instruction suggests that you're in such a case. It doesn't mean you actually are though, for example if the address of the next load which misses is calculated by a long calculation that itself depends on the previous miss, then it all becomes serialized and the misses/instruction becomes a bit misleading. Even so, if it's low enough, then the total time mostly depends on the arithmetic so the misses wouldn't be a big problem.
It shouldn't necessarily be optimized, since you can cheat by doing useless arithmetic work. Or, more reasonably, do a trade-off that costs more (useful) arithmetic to lose some misses, which sounds good except that you can go too far with it. Obviously if it starts taking more actual time (or in general it starts performing worse in whatever you really care about), it doesn't matter that you're improved some fairly artificial metric (unless that really was what you cared about, which you might in a synthetic benchmark).
Misses/reference obviously tells you something about your access patterns, but doing lots of cache hitting memory references isn't an indicator of good code: maybe there are just a lot of unnecessary memory references. Or to put it an other way, if data is only needed once, then touching it again (even when that generates no miss) is still a waste. It really depends on the problem. For example, if you're just summing up an array, putting the accumulator(s) in memory looks really good from the point of view of this metric, but it's obviously a really bad thing to do.
So,
Which ratio should I aim to optimize?
Neither, unless you're going for something purely synthetic. Use them to get a clue about how the code is performing, then optimize elapsed time (or power or whatever, depends on your goals). These metics being "suspiciously good" can be as much of a clue as them being "bad", so perhaps that should be just called low and high.

Algorithmic Complexity Analysis: practically using Knuth's Ordinary Operations (oops) and Memory Operations (mems) method

In implementing most algorithms (sort, search, graph traversal, etc.), there is frequently a trade-off that can be made in reducing memory accesses at the cost of additional ordinary operations.
Knuth has a useful method for comparing the complexity of various algorithm implementations by abstracting it from particular processors and only distinguishing between ordinary operations (oops) and memory operations (mems).
In compiled programs, one typically lets the compiler organise the low level operations, and hopes that the operating system will handle the question of whether data is held in cache memory (faster) or in virtual memory (slower). Furthermore, the exact number / cost of instructions is encapsulated by the compiler.
With Forth, there is no longer such encapsulation, and one is much closer to the machine, albeit perhaps to a stack machine running on top of a register processor.
Ignoring the effect of an operating system (so no memory stalls, etc.), and assuming for the moment a simple processor,
(1) Can anyone advise on how the ordinary stack operations in Forth (e.g. dup, rot, over, swap, etc.) compare with the cost of Forth's memory access fetch (#) or store (!) ?
(2) Is there a rule of thumb I can use to decide how many ordinary operations to trade-off against saving a memory access?
What I'm looking for is something like 'memory access costs as much as 50 ordinary ops, or 500 ordinary ops, or 5 ordinary ops' Ballpark is absolutely fine.
I'm trying to get a sense of the relative expense of fetch and store vs. rot, swap, dup, drop, over, correct to an order of magnitude.

This article How much time does it take to fetch one word from memory? talks about main memory stall times, with some rule of thumb type numbers, but basically you can do lots of instructions while stalling for main memory. As others have said, the numbers vary a lot between systems.
Main memory stalls is a big area of interest, especially as CPUs have more cores, but typically not much faster memory bandwidth. There is some research going on around compressing data in main memory too, so that the CPU can take advantage of 'spare' cycles and tightly packed cache lines http://oai.cwi.nl/oai/asset/15564/15564B.pdf
For those who are really interested in the details, most CPU manufacturers publish in depth guides on memory optimisations etc. mostly aimed at high end and compiler writers, but readable by all 2gl and 3gl programmers.
Ps. Go Forth.

A comparison between memory fetches and register operations is okay for assembler programs, as it is for the output of c-compilers, which is in fact an assembler program.
In Forth this question hardly makes sense. In the first place Forth is an interpreter and in using Forth one foregoes the ultimate in speed. Of course one could add an optimiser on top of Forth but then the question makes even less sense, because the output of a c-optimiser and a Forth optimiser converge to -- you guessed it -- an optimal solution.
Let's look at an elementary operation in Forth like AND.
This is implemented as
> CODE AND
> POP AX
> POP BX
> AND AX, BX
> PUSH AX
> NEXT
So we see already three memory operations for something that looks like an elementary calculation operation. It appears the Knuth metric is not applicable. Also Forth seems to be loosing big time.That is however not true. Those memory operations are all onto the L1 cache of a typical processor. That is about as efficient as local variables in small c functions,
We can compare stack operations with memory operations using VARIABLE's and the stack. The answer is simple. A VARIABLE risks a memory stall. A stack operation will almost certainly be a L1 cache hit. This is the single most important point of consideration. However the question explicitly asks not to consider it!
So there.

How to compare two implementations of the same algorithm? (by examine their Assembly code)

Assume I have two implementations of the same algorithm in assembly. I would like to know by examining the two snippets codes which one is faster.
The parameters I thought one might take into account are: number of op-codes, number of branches, number of function frames.
My questions are:
Can I assume each opcode execution is one cycle ?
What is the overhead of branch which break the pipeline ?
What are the effects and overhead of calling a function ?
Is there a difference in the analysis between ARM and x86 ?
The question is theoretical since I have two implementations; one 130 instructions long and one is 184 instructions long.
And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?
"BETTER == FASTER"

Without wanting to be flippant, the answers are
no
that depends on your hardware
that depends on your hardware
yes
You would really need to test things on your target hardware, or have a simulator that understands your hardware fully, in order to answer your question the way you meant to...
For the last part of your question, you need to define "better"…better.

Since you asked about a Cortex A9, the data sheet has instruction cycle counts in appendix B. These counts generally assume that the memory bus is fast enough to keep the CPU busy. In reality this is rarely the case. Many video/audio algorithms will have a big win in how they access memory.
One cycle per op
Of course you can't assume this if you want an exact count. However, if you are deciding which algorithm to choose, you can get a feel for the best algorithm by looking at the instructions in the inner loop. Here, your cache should allow the code to execute as per the instruction counts in the data sheet. If the counts are close, then you probably need to look at each instruction. Load/stores are more expensive and usually multiples, etc. Some algorithms, especially crytographic, will have big wins by using assembler that doesn't map well to C. For example, clz, ror, using the carry for multi-word arithmetic, etc.
Branch overhead
Look in Appendix B, or whatever data sheet has cycle counts for your processor. For an ARM926 it is about 3 cycles. The compiler only generates two conditional opcodes in a row to avoid branching, otherwise, it branches. If the algorithm is large, the branch may disrupt the cache. A hard answer depends on your CPU, cache, and memory. According to the Cortex A9 datasheet (B.5), there is only one cycle overhead to a fixed branch.
Function overhead
This is much the same as the branch overhead. However, the compiler will also have an influence. noted by Jim Does it cache align functions. Does the compiler perform leaf function optimizations, etc. With modern gcc versions, if all the functions are static, the compiler will generally in-line when it is advantageous. If the algorithms are particularly large, a register spill may be advantageous. However, with your example of 130/184 instructions, this seems unlikely. The compiler options will obviously effect the overhead. You can use objdump -S to examine the prologue/epilogue and then determine the number of cycles for your hardware.
ARM verus x86
Of course there is a technical difference in the cycle counts. The CISC x86 also has variable instruction size. This complicates the analysis. It is slightly easier on the ARM.
Normally, you want to ball park things and then actually run them with a profiler. The estimates can help guide development of the algorithms. Loop/memory tuning, etc for your hardware. Something like instruction emulation, page or alignment faults, etc may be dominant and make all the cycle count analysis meaningless. If the algorithm is in user space, per-emption, may negate cache wins from run to run. It is possible that one algorithm will work better in a little loaded system and the other will work better under a higher load.
A note on cycle counts
See the post-process objdump for some complications in getting cycle counts. Basically a typical CPU is several phases (a pipe line) and different conditions can cause stalls. As CPU's become more complex, the pipe line typically gets longer, meaning there are more conditions or phases which can stall. However, cycle count estimates can be helpful in guiding development of an algorithm and evaluating them. Things like memory timing or branch prediction can be just as important, depending on the algorithm. Ie, cycle counts are not completely useless, but they are not complete either. Profiling should confirm actual algorithm times. If they diverge, instruction re-ordering, pre-fetching and other techniques may bring them closer. The fact that cycle counts and active profiling diverge can be helpful in itself.

It is definitely not true to say that the 130 instruction code is faster than the 184 instruction code. it is very easy to have 1000 instructions run faster than 100 and vice versa on either of these platforms.
1 Can I assume each opcode execution is one cycle ?
Start by looking at the advertised mips/mhz, although a marketing number it gives a rough idea of what is possible. If the number is greater than one then more than one instruction per clock is possible.
2 What is the overhead of branch which break the pipeline ?
Anywhere from absolutely no affect to a very dramatic affect, on either system. one clock to hundreds are the potential penalty.
3 What are the effects and overhead of calling a function ?
Depends heavily on the function, and the function calling the function. Depending on the calling convention you might have to save registers to the stack, or rearrange the contents of registers to prepare for the parameters for the function to be called. If passing a struct by value a copy of the struct may need to be made on the stack, the bigger the struct passed the bigger the copy. once in the function a stack frame may need to be prepared, etc, etc. There are many factors involved. This question and answer are also independent of platform.
4 Is there a difference in the analysis between ARM and x86 ?
yes and no, both systems use all the modern tricks of pipelining, branch prediction, etc to keep the mips/mhz up. ARM is going to give a better mips per mhz than x86, x86 being variable instruction length might give more instructions per unit cache. How you analyze the cache, and memory and peripheral systems in the systems side of the analysis is roughly the same. The comparison of the instructions and core are similar and different depending on what aspects you are analyzing. The arm is not microcoded, the x86 likely is so you dont really see how many registers there really are, things like that. at the same time the x86 you can get a better look at the memory system with the arm, since they are generally not system on a chip. Depending on what ARM chip you buy you may lose a lot of the visibility in the boundaries of the chip, might not see all the memory and peripheral busses, for example. (x86 is changing that by putting pcie on chip now for example) in the case of something in the cortex-a class you mentioned you would have similar edge of chip visibility as those would use larger/cheaper dram based memory off chip rather than microcontroller like on chip resources.
Bottom line your final question:
"And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?"
It is definitely NOT TRUE to say the 130 instruction snippet is faster than the 184 instruction snippet. It might be faster it might be slower and it might be about the same. With a lot more information we might be able to make a pretty good statement or it may still be non-deterministic. it is easy to choose 100 instructions that execute faster than 1000 instructions and likewise easy to choose 1000 instructions that execute faster than 100 instructions (even if I were to add no branching and no loops, just linear execution)

Your question is almost entirely meaningless: It probably depends on your input.
Most CPUs have something resembling a branch misprediction penalty (e.g. traditional ARM which throws away an instruction fetch/decode on any taken branch, IIRC). ARM and x86 also allow conditional execution, which can be faster than branching. If either of these are dependent on input data, then different inputs will follow different code paths.
Perhaps one version heavily uses conditional execution, which is wasteful when the condition is false. Perhaps another was compiled using some profiling information that performs no branches (except the return at the end) for a specific case. There are many, many reason why a compiler can take the same source and produce an "optimized" output which is faster for one input and slower for another.
Many optimizations have this characteristic — for example, aligning the start of a loop to 16 bytes helps on some processors, but not when the loop is only executed once.

Some text book answer to this question from Cortex
™
-A Series Programmer’s Guide, chapter 17.
Although cycle timing information can be found in the Technical Reference Manual (TRM) for the processor that you are using, it is very difficult to work out how many cycles even a trivial piece of code will take to execute. The movement of instructions through the pipeline is dependent on the progress of the surrounding instructions and can be significantly affected by memory system
activity. Pending loads or instruction fetches which miss in the cache can stall code for tens of cycles. Standard data processing instructions (logical and arithmetic) will take only one or two cycles to execute, but this does not give the full picture. Instead, we must use profiling tools, or the system performance monitor built-in to the processor, to extract useful information about performance.
Also read under 17.4 Cortex-A9 micro-architecture optimizations which answers your question very very much.

measuring real running time of an algorithm

Approximately, how many physical instructions of MIPS does an abstract algorithm operation amortize to? As for an abstract algorithm operation, I means a basic operation, such as add, divide, etc.
I see this is not a strict measuring technique :-)
Kejia

There is a list of the basic MIPS instructions here. Most of the "basic operations" that you mentioned are a single MIPS instruction or perhaps two, which probably holds true on most current CPU families.
However this does not take into account at all the architecture and performance characteristics of any of the modern CPUs. Different instructions often have diffrent completion times. Current CPUs usually implement branch prediction, instruction pipelines, memory caching, parallelisation and a whole list of other techniques to make the code execution faster.
Therefore just having the assembly code implementation of an algorithm says nothing about its execution speed. You would have to measure and profile the code on the actual hardware to obtain comparable results. In fact, some algorithms may be far more effective on certain CPUs, even within the same CPU family.
A common and rather understandable example is the effect of the instruction cache. Unrolling a loop will eliminate a number of branch operations, which intuitively makes code faster. If you run that code on a CPU of the same family with very little instruction cache memory, though, the added accesses to the main memory can make it far slower than the simple branch-based loop.

Computers are complicated. If you want to get down to this level you need to start considering what kind of CPU you are using, how well your compiler can use this CPU's instruction set, what variables are being kept in what registers, what are their bit-level representations, etc. Even then, the number of instructions not always easily maps to the actual running time. Different instructions can take different ammounts of clock cycles to execute and this is not even thinking about OS threading and your program's cache miss rate.
In the end, there is a good reason we use big-O notatoin in the first place :)
BTW, most simple operations (add, subtract) on integers should map to a single machine instruction, in case you are worried.

It depends on the CPU architecture. Some processors requires several cycles for a single instruction such as divivide, while others manage to execute all machine code instructions in a single cycle each.
It is sometimes relevant to measure an algorithm in how many floating point operations it requires. However this does not take I/O (such as reading memory) into consideration.
The speed of a CPU is sometimes provided in FLOPS (Floating Point OPerations per Second) which could help to give you a time estimate. Again, not taking I/O into consideration - and not multi-threading issues (also a very important measuring factor).

Donald Knuth addressed this very problem in writing Volume 1 of "The Art of Computer Programming".
In the preface he gives a lengthy justification for presenting algorithms in the assembly code for an imaginary machine -
... To avoid this dilemma, I have
attempted to design an "ideal"
computer called "MIX," with very
simple rules of operation ...
That way, one can talk sensibly about how many "cycles" an algorithm would take, without having to care about differences between machines, caching, latency, pipelines, or any of the other ways computers have been optimized to save time, at the expense of knowing how long they will take.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio