Why not just predict both branches? - cpu

CPU's use branch prediction to speed up code, but only if the first branch is actually taken.
Why not simply take both branches? That is, assume both branches will be hit, cache both sides, and the take the proper one when necessary. The cache does not need to be invalidated. While this requires the compiler to load both branches before hand(more memory, proper layout, etc), I imagine that proper optimization could streamline both so that one can get near optimal results from a single predictor. That is, one would require more memory for loading both branches(which is exponential for N branches), the majority of the time one should be able to "recache" the failed branch with new code quickly enough before it has finished executing the branch taken.
if (x) Bl else Br;
Instead of assuming Bl is taken, assume that both Bl and Br are taken(some type of parallel processing or special interleaving) and after the branch is actually determined, one branch is then invalid and the cache could then be freed for use(maybe some type of special technique would be required to fill and use it properly).
In fact, no prediction circuitry is required and all the design used for that could be, instead, used to handle both branches.
Any ideas if this is feasible?

A Historical Perspective on Fetching Instructions from both Paths
The first similar proposal (to my knowledge) was discussed in this 1968 patent. I understand that you are only asking about fetching instructions from both branches, but bear with me a little. In that patent, three broad strategies were laid out, one of them is following both paths (the fall-through path and the branch path). That is, not just fetching instructions from both paths, but also executing both paths. When the conditional branch instruction is resolved, one of the paths is discarded. It was only mentioned as an idea in the introduction of the patent, but the patent itself was about another invention.
Later in 1977, a commercial processor was released from IBM, called the IBM 3033 processor. That is the first processor (to my knowledge) to implement exactly what you are proposing. I'm surprised to see that the Wikipedia page did not mention that the processor fetched instructions from both paths. The paper that describes the IBM 3033 is titled "The IBM 3033: An inside look". Unfortunately, I'm not able to find the paper. But the paper on the IBM 3090 does mention that fact. So what you're proposing did make sense and was implemented in real processors about half a decade ago.
A patent was filed in 1981 and granted in 1984 about processor with two memories and instructions can be fetched from both memories simultaneously. I quote from the abstract of the patent:
A dual fetch microsequencer having two single-ported microprogram
memories wherein both the sequential and jump address
microinstructions of a binary conditional branch can be simultaneously
prefetched, one from each memory. The microprogram is assembled so
that the sequential and jump addresses of each branch have opposite
odd/even polarities. Accordingly, with all odd addresses in one memory
and even in the other, the first instruction of both possible paths
can always be prefetched simultaneously. When a conditional branch
microinstruction is loaded into the execution register, its jump
address or a value corresponding to it is transferred to the address
register for the appropriate microprogram memory. The address of the
microinstruction in the execution register is incremented and
transferred to the address register of the other microprogram memory.
Prefetch delays are thereby reduced. Also, when a valid conditional
jump address is not provided, that microprogram memory may be
transparently overlayed during that microcycle.
A Historical Perspective on Fetching and Executing Instructions from both Paths
There is a lot of research published in the 80s and 90s about proposing and evaluating techniques by which instructions from both paths are not only fetched but also executed, even for multiple conditional branches. This will have the potential additional overhead of fetching data required by both paths. The idea of branch prediction confidence was proposed in this paper in 1996 and was used to improve such techniques by being more selective regarding which paths to fetch and execute. Another paper (Threaded Multiple Path Execution) published in 1998 proposes an architecture that exploits simultaneous multithreading (SMT) to run multiple paths following conditional branches. Another paper (Dual Path Instruction Processing) published in 2002 proposes to fetch, decode, and rename, but not execute, instructions from both paths.
Discussion
Fetching instructions from both paths into one or more of the caches reduces the effective capacity of the caches in general, because, typically, one of the paths will be executed much more frequently than the other (in some, potentially highly irregular, pattern). Imagine fetching into the L3 cache, which practically always shared between all the cores and holds both instructions and data. This can have negative impact on the ability of the L3 cache to hold useful data. Fetching into the much smaller L2 cache can even lead to a substantially worse performance, especially when the L3 is inclusive. Fetching instructions from both paths across multiple conditional branches for all the cores may cause hot data held in the caches to be frequently evicted and brought back. Therefore, extreme variants of the technique you are proposing would reduce the overall performance of modern architectures. However, less aggressive variants can be beneficial.
I'm not aware of any real modern processors that fetch instructions on both paths when they see a conditional branch (perhaps some do, but it's not publicly disclosed). But instruction prefetching has been extensively researched and still is. An important question here that needs to be addressed is: what is the probability that a sufficient number of instructions from the other path are already present in the cache when the predicted path turns out to be the wrong path? If the probability is high, then there would be little motivation to fetch instructions from both paths. Otherwise, there is indeed an opportunity. According to an old paper from Intel (Wrong-Path Instruction Prefetching), on the benchmarks tested, over 50% of instructions accessed on mispredicted paths were later accessed during correct path execution. The answer to this question certainly depends on the target domain of the processor being designed.

Related

Optimizing double->long conversion with memory operand using perf?

I have an x86-64 Linux program which I am attempting to optimize via perf. The perf report shows the hottest instructions are scalar conversions from double to long with a memory argument, for example:
cvttsd2si (%rax, %rdi, 8), %rcx
which corresponds to C code like:
extern double *array;
long val = (long)array[idx];
(This is an unusual bottleneck but the code itself is very unusual.)
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question? What other data should I collect and how should I proceed to optimize this?
Some things I have looked at already. CPU counter results show 1.5% cache misses per instruction:
30686845493287      cache-references
     2140314044614      cache-misses           #    6.975 % of all cache refs
          52970546      page-faults
  1244774326560850      instructions
   194784478522820      branches
     2573746947618      branch-misses          #    1.32% of all branches
          52970546      faults
Top-down performance monitors show we are primarily backend-bound:
frontend bound retiring bad speculation backend bound
10.1% 25.9% 4.5% 59.5%
Ad-hoc measurement with top shows all CPUs pegged at 100% suggesting we are not waiting on memory.
A final note of interest: when run on AWS EC2, the code is dramatically slower (44%) on AMD vs Intel with the same core count. (Tested on Ice Lake 8375C vs EPYC 7R13). What could explain this discrepancy?
Thank you for any help.
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question?
I think there is two main reason for this instruction to be slow. 1. There is a dependency chain and the latency of this instruction is a problem since the processor is waiting on it to execute other instructions. 2. There is a cache miss (saturating the memory with such instruction is improbable unless many cores are doing memory-based operations).
First of all, tracking what is going on on a specific instruction is hard (especially if the instruction is not executed a lot of time). You need to use precise events to track the root of the problem, that is, events for which the exact instruction addresses that caused the event are available. Only a (small) subset of all events are precise one.
Regarding (1), the latency of the instruction should be about 12 cycles on both architecture (although it might be slightly more on the AMD processor, I do not expect a 44% difference). The target processor are able to execute multiple instruction at the same time in a given cycle. Instructions are executed on different port and are also pipelined. The port usage matters to understand what is going on. This means all the instruction in the hot loop matters. You cannot isolate this specific instruction. Modern processors are insanely complex so a basic analysis can be tricky. On Ice Lake processors, you can measure the average port usage with events like UOPS_DISPATCHED.PORT_XXX where XXX can be 0, 1, 2_3, 4_9, 5, 6, 7_8. Only the first three matters for this instruction. The EXE_ACTIVITY.XXX events may also be useful. You should check if a port is saturated and which one. AFAIK, none of these events are precise so you can only analyse a block of code (typically the hot loop). On Zen 3, the ports are FP23 and FP45. IDK what are the useful events on this architecture (I am not very familiar with it).
On Ice Lake, you can check the FRONTEND_RETIRED.LATENCY_GE_XXX events where XXX is a power of two integer (which should be precise one so you can see if this instruction is the one impacting the events). This help you to see whether the front-end or the back-end is the limiting factor.
Regarding (2), you can check the latency of the memory accesses as well as the number of L1/L2/L3 cache hits/misses. On Ice Lake, you can use events like MEM_LOAD_RETIRED.XXX where XXX can be for example L1_MISS L1_HIT, L2_MISS, L2_HIT, L3_MISS and L3_HIT. Still on Ice Lake, t may be useful to track the latency of the memory operation with MEM_TRANS_RETIRED.LOAD_LATENCY_GT_XXX where XXX is again a power of two integer.
You can also use LLVM-MCA to simulate the scheduling of the loop instruction statically on the target architecture (do not consider branches). This is very useful to understand deeply what the scheduler can do pretty easily.
What could explain this discrepancy?
The latency and reciprocal throughput should be about the same on the two platform or at least close. That being said, for the same core count, the two certainly do not operate at the same frequency. If this is not coming from that, then I doubt this instruction is actually the problem alone (tricky scheduling issues, wrong/inaccurate profiling results, etc.).
CPU counter results show 1.5% cache misses per instruction
The thing is the cache-misses event is certainly not very informative here. Indeed, it references the last-level cache (L3) misses. Thus, it does not give any information about the L1/L2 misses (previous events do).
how should I proceed to optimize this?
If the code is latency bound, the solution is to first break any dependency chain in this loop. Unrolling the loop dans rewriting it so to make it more SIMD-friendly can help a lot to improve performance (the reciprocal throughput of this instruction is about 1 cycle as opposed to 12 for the latency so there is a room for improvements in this case).
If the code is memory bound, they you should care about data locality. Data should fit in the L1 cache if possible. There are many tips to do so but it is hard to guide you without more context. This includes for example sorting data, reordering loop iterations, using smaller data types.
There are many possible source of weird unusual unexpected behaviours that can occurs. If such a thing happens, then it is nearly impossible to understand what is going on without the exact code executed. All details matter in this case.

Drhystone benchmark on 32 bit micrcontroller

Currently I am doing a performance comparison on two 32bit microcontrollers. I used Dhrystone benchmark to run on both microcontrollers. One microcontroller has 4KB I-cache while second coontroller has 8KB of I-cache. Both microcontrollers are using same tool chain. As much as possible I kept same static and run-time settings onboth microcontrollers. But microcontroller with 4KB cache are faster than 8KB cache microcontroller. Both microcontroller are from same vendor and based on same CPU.
Could anyone provide some information why microcontroller with 4KB cache is faster than other?
Benchmarks are in general useless. dhrystone being one of the oldest ones and it may have had a little bit of value back then, before pipelines and too much compiler optimization. I think I gave up on dhrystone 15 years ago roughly about the time I started using it.
It is trivial to demonstrate that this code
.globl ASMDELAY
ASMDELAY:
sub r0,r0,#1
bne ASMDELAY
bx lr
Which is primarily two INSTRUCTIONS, can vary WIDELY in execution time on the same chip if you understand how modern processors work. The simple trick to seeing this is turn off the caches and prefetchers and such, and place this code at offset 0x0000, call it with some value. place it at 0x0004 repeat, then at 0x0008, repeat. keep doing this. You can put one, two, etc nops between the subtract and branch. try it at various offsets.
THEN, turn on and of caches for each of these alignments, if you have a prefetch outside the processor for the flash, turn that on and off.
THEN, vary your clocks, esp for those MCUs where you have to adjust the wait states based on clock rate.
On a SINGLE MCU, you are going to see those essentially two instructions vary in execution time by a very large amount. Lets just say 20 times longer in some cases than others.
Now take a small program or a small fraction of the dhrystone program. Compile that for your mcu, how many instructions do you see? Make minor to major optimization and other variations on the compile command line, how much does the code change. If two instructions can vary by lets call it 20 times in execution time, how bad can it get for 200 instructions or 2000 instructions? It can get pretty bad.
If you take the dhrystone programs you have right now with the compiler options you have right now, go into your bootstrap, add one nop (causing the entire binary to shift by one instruction in flash) run again. Add two, three, four. You are still not comparing different mcus just running your benchmark on one system.
Run with and without d cache, with and without i cache if you have each of those. turn on and off the flash prefetch if you have one and if you have a write buffer you can turn on and of try that. Still remaining on the same compiler same options same mcu.
Take the different functions in the dhrystone source code, and re-arrange them in the source code. instead of Proc_1, Proc_2, Proc_3 make it Proc_1, Proc_3, Proc_2. Do all of the above again. Re-arrange again, repeat.
Before leaving this mcu you should now see that the execution time of the same source code which is completely unmodified (other than perhaps re-arranging functions) can and will have vastly different execution times.
if you then start changing compiler options or keep the same source and change compilers, you will see even more of a difference in execution time.
How is it possible that dhrystone benchmarks today or from way back had a single result for each platform? Simple, that result was just one of a wide range, not really representing the platform.
So then if you try to compare two different hardware platforms be it the same arm core inside a different mcu from the same vendor or different vendors. the arm core, assuming (which is not a safe assumption) it is the same source with the same compile/build options, even assuming the same verilog compile and synthesis was used. you can have that same core change based on arm provided options. Anyway, how the vendor be it the same vendor in two instances or two different vendors wraps that same core, you will see variations. Then take a completely different core be it another arm or a mips, etc. How could you gain any value comparing those using a program like this that itself varies widely on each platform?
You cant. What you can do is use benchmarks to give the illusion of one thing being better than another, one computer is faster than another, one compiler is faster than another. In order to sell computers or compilers. Sprints coverage is within one percent of Verizons...does that tell us anything useful? Nope.
If you were to eliminate the compiler from the equation, and if these are truly the same "CPU", same rev of source from ARM, built the same way, then they should fetch the same, but the size of the cache is part of that so it may already be a different cpu implementation as the width or depth of the cache can affect things. In software it is like needing a 32 bit pointer rather than a 16 bit pointer (17 bit instead of 16, but you cant have a 17 bit generally in logic you can).
Anyway, if you compile the code under test one time for an address space that is common to both platforms, use that same binary exactly for that space, can attach different bootstrap code as needed, note the C library calls strcpy, etc also have to be the same in the same space between platforms to eliminate the compiler and alignment from messing you up. this may or may not level the playing field.
If you want to believe these are the same cpu, then turn the caches off, eliminate the compiler variations by doing the above. See if they execute the same. Copy the program to ram and run in ram, eliminate the flash issues. I assume you have them both clocked the same with the same wait states in the flash?
if they are the same cpu and the chip vendor has with these two chips made the memory system take the same number of clocks say for ram accesses, and it is really the same cpu, you should be able to get the same time by eliminating optimizations (caching, flash prefetching, alignment).
what you are probably seeing is some form of alignment with how the code lies in memory from the compiler vs cache lines, or it could be much much simpler it could be just the differences in the caches, how the hits and misses work and the 4KB is just more lucky than the 8KB for this particular program compiled a certain way, aligned in memory a certain way, etc.
With the simple two instruction loop above it is easy to see some of the reasons why performance varies on the same system, if your modern cpu fetches 8 instructions at a time and your loop gets too close to the tail end of that fetch, the prefetch may think it needs to fetch another 8 beyond that costing you those clock cycles. Certainly as you exactly straddle two "fetch lines" as I call them with those two instructions it is going to cost you a lot more cycles per loop even with a cache. Same problem happens when those two instructions approach a cache line (as you vary their alignment per test) eventually it takes two cache line reads instead of one to fetch those two instructions. At least the first time through there is an extra cache line read. The extra clocks for that first time through is something you can see using a simple benchmark like that while playing with alignment.
Michael Abrash, Zen of Assembly Language. There is an epub/etc you can build from github of this book. the 8088 was obsolete when this book came out if that is all you see 8088 stuff, then you are completely missing the point. It applies to the most modern processors today, how to view the problem, how to test, how to time the test, how to interpret the results. All the stuff I have mentioned so far, and all the things I know about this that I have not mentioned, all came from that books knowledge applied for however many decades I have been doing this.
So again if you have truly eliminated the compiler, alignment, the cpu, the memory system tied to that cpu, etc and it is down to only the size of the cache varies. Then it is probably related to how the cache line fetches hit and miss differently based on alignment of the code relative to the cache lines for the two caches. One is hitting more and missing less and/or evicting better for this particular binary. You can try rearranging the functions, can add nops, or if you cant get at the bootstrap then add whole functions or more code (another printf, etc) at a lower address in the binary causing the linker to slide the code under test through to different addresses changing how the program lines up with the cache lines. Since the functions in the code under test are so big (more than a few instructions to implement) you would have to start modifying the program in order to get a finer grained adjustment of binary relative to cache lines.
You most definitely should see execution time differences on the two platforms if you adjust alignment and or wholesale re-arrange the binary based on functions being re-arranged.
Bottom line benchmarks dont really tell you much, the results have more of a negative stink to them than a positive joy. Without a re-write a particular benchmark or application may just do better on one platform (be it everything is the same but the size of the cache or two completely different architectures) even when you try to vary the results due to alignment, turning on and off prefetching, write buffering, branch prediction, etc. Pulling out all the tricks you can come up with one may vary from x to y the other from n to m and maybe there is some overlap in the ranges. For these two platforms that you say are similiar except for cache size I would hope you could find a combination where sometimes A is faster than B and at least one combination where B is faster than A, with the same features turned on and off for both in that comparision. If/when you switch to some other application/benchmark this all starts over again, no reason to assume dhrystone results predict any other code under test. The only program that matters, esp on an MCU is the final build of your application. Just remember changing a single line of code, or even adding a single nop in the bootstrap can have dramatic results in performance sometimes several TIMES slower or faster from a single nop.

Out-of-order instruction execution: is commit order preserved?

On the one hand, Wikipedia writes about the steps of the out-of-order execution:
Instruction fetch.
Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations).
The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before
earlier, older instructions.
The instruction is issued to the appropriate functional unit and executed by that unit.
The results are queued.
Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.
The similar information can be found in the "Computer Organization and Design" book:
To make programs behave as if they were running on a simple in-order
pipeline, the instruction fetch and decode unit is required to issue
instructions in order, which allows dependences to be tracked, and the
commit unit is required to write results to registers and memory in
program fetch order. This conservative mode is called in-order
commit... Today, all dynamically scheduled pipelines use in-order commit.
So, as far as I understand, even if the instructions execution is done in the out-of-order manner, the results of their executions are preserved in the reorder buffer and then committed to the memory/registers in a deterministic order.
On the other hand, there is a known fact that modern CPUs can reorder memory operations for the performance acceleration purposes (for example, two adjacent independent load instructions can be reordered). Wikipedia writes about it here.
Could you please shed some light on this discrepancy?
TL:DR: memory ordering is not the same thing as out of order execution. It happens even on in-order pipelined CPUs.
In-order commit is necessary1 for precise exceptions that can roll-back to exactly the instruction that faulted, without any instructions after that having already retired. The cardinal rule of out-of-order execution is don't break single-threaded code. If you allowed out-of-order commit (retirement) without any kind of other mechanism, you could have a page-fault happen while some later instructions had already executed once, and/or some earlier instructions hadn't executed yet. This would make restarting execution after handing a page-fault impossible the normal way.
(In-order issue/rename and dependency-tracking takes care of correct execution in the normal case of no exceptions.)
Memory ordering is all about what other cores see. Also notice that what you quoted is only talking about committing results to the register file, not to memory.
(Footnote 1: Kilo-instruction Processors: Overcoming the Memory Wall is a theoretical paper about checkpointing state to allow rollback to a consistent machine state at some point before an exception, allowing much larger out-of-order windows without a gigantic ROB of that size. AFAIK, no mainstream commercial designs have used that, but it shows that there are in theory approaches other than strictly in-order retirement to building a usable CPU.
Apple's M1 reportedly has a significantly larger out-of-order window than its x86 contemporaries, but I haven't seen any definite info that it uses anything other than a very large ROB.)
Since each core's private L1 cache is coherent with all the other data caches in the system, memory ordering is a question of when instructions read or write cache. This is separate from when they retire from the out-of-order core.
Loads become globally visible when they read their data from cache. This is more or less when they "execute", and definitely way before they retire (aka commit).
Stores become globally visible when their data is committed to cache. This has to wait until they're known to be non-speculative, i.e. that no exceptions or interrupts will cause a roll-back that has to "undo" the store. So a store can commit to L1 cache as early as when it retires from the out-of-order core.
But even in-order CPUs use a store queue or store buffer to hide the latency of stores that miss in L1 cache. The out-of-order machinery doesn't need to keep tracking a store once it's known that it will definitely happen, so a store insn/uop can retire even before it commits to L1 cache. The store buffer holds onto it until L1 cache is ready to accept it. i.e. when it owns the cache line (Exclusive or Modified state of the MESI cache coherency protocol), and the memory-ordering rules allow the store to become globally visible now.
See also my answer on Write Allocate / Fetch on Write Cache Policy
As I understand it, a store's data is added to the store queue when it "executes" in the out-of-order core, and that's what a store execution unit does. (Store-address writing the address, and store-data writing the data into the store-buffer entry reserved for it at allocation/rename time, so either of those parts can execute first on CPUs where those parts are scheduled separately, e.g. Intel.)
Loads have to probe the store queue so that they see recently-stored data.
For an ISA like x86, with strong ordering, the store queue has to preserve the memory-ordering semantics of the ISA. i.e. stores can't reorder with other stores, and stores can't become globally visible before earlier loads. (LoadStore reordering isn't allowed (nor is StoreStore or LoadLoad), only StoreLoad reordering).
David Kanter's article on how TSX (transactional memory) could be implemented in different ways than what Haswell does provides some insight into the Memory Order Buffer, and how it's a separate structure from the ReOrder Buffer (ROB) that tracks instruction/uop reordering. He starts by describing how things currently work, before getting into how it could be modified to track a transaction that can commit or abort as a group.

How to compare two implementations of the same algorithm? (by examine their Assembly code)

Assume I have two implementations of the same algorithm in assembly. I would like to know by examining the two snippets codes which one is faster.
The parameters I thought one might take into account are: number of op-codes, number of branches, number of function frames.
My questions are:
Can I assume each opcode execution is one cycle ?
What is the overhead of branch which break the pipeline ?
What are the effects and overhead of calling a function ?
Is there a difference in the analysis between ARM and x86 ?
The question is theoretical since I have two implementations; one 130 instructions long and one is 184 instructions long.
And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?
"BETTER == FASTER"
Without wanting to be flippant, the answers are
no
that depends on your hardware
that depends on your hardware
yes
You would really need to test things on your target hardware, or have a simulator that understands your hardware fully, in order to answer your question the way you meant to...
For the last part of your question, you need to define "better"…better.
Since you asked about a Cortex A9, the data sheet has instruction cycle counts in appendix B. These counts generally assume that the memory bus is fast enough to keep the CPU busy. In reality this is rarely the case. Many video/audio algorithms will have a big win in how they access memory.
One cycle per op
Of course you can't assume this if you want an exact count. However, if you are deciding which algorithm to choose, you can get a feel for the best algorithm by looking at the instructions in the inner loop. Here, your cache should allow the code to execute as per the instruction counts in the data sheet. If the counts are close, then you probably need to look at each instruction. Load/stores are more expensive and usually multiples, etc. Some algorithms, especially crytographic, will have big wins by using assembler that doesn't map well to C. For example, clz, ror, using the carry for multi-word arithmetic, etc.
Branch overhead
Look in Appendix B, or whatever data sheet has cycle counts for your processor. For an ARM926 it is about 3 cycles. The compiler only generates two conditional opcodes in a row to avoid branching, otherwise, it branches. If the algorithm is large, the branch may disrupt the cache. A hard answer depends on your CPU, cache, and memory. According to the Cortex A9 datasheet (B.5), there is only one cycle overhead to a fixed branch.
Function overhead
This is much the same as the branch overhead. However, the compiler will also have an influence. noted by Jim Does it cache align functions. Does the compiler perform leaf function optimizations, etc. With modern gcc versions, if all the functions are static, the compiler will generally in-line when it is advantageous. If the algorithms are particularly large, a register spill may be advantageous. However, with your example of 130/184 instructions, this seems unlikely. The compiler options will obviously effect the overhead. You can use objdump -S to examine the prologue/epilogue and then determine the number of cycles for your hardware.
ARM verus x86
Of course there is a technical difference in the cycle counts. The CISC x86 also has variable instruction size. This complicates the analysis. It is slightly easier on the ARM.
Normally, you want to ball park things and then actually run them with a profiler. The estimates can help guide development of the algorithms. Loop/memory tuning, etc for your hardware. Something like instruction emulation, page or alignment faults, etc may be dominant and make all the cycle count analysis meaningless. If the algorithm is in user space, per-emption, may negate cache wins from run to run. It is possible that one algorithm will work better in a little loaded system and the other will work better under a higher load.
A note on cycle counts
See the post-process objdump for some complications in getting cycle counts. Basically a typical CPU is several phases (a pipe line) and different conditions can cause stalls. As CPU's become more complex, the pipe line typically gets longer, meaning there are more conditions or phases which can stall. However, cycle count estimates can be helpful in guiding development of an algorithm and evaluating them. Things like memory timing or branch prediction can be just as important, depending on the algorithm. Ie, cycle counts are not completely useless, but they are not complete either. Profiling should confirm actual algorithm times. If they diverge, instruction re-ordering, pre-fetching and other techniques may bring them closer. The fact that cycle counts and active profiling diverge can be helpful in itself.
It is definitely not true to say that the 130 instruction code is faster than the 184 instruction code. it is very easy to have 1000 instructions run faster than 100 and vice versa on either of these platforms.
1 Can I assume each opcode execution is one cycle ?
Start by looking at the advertised mips/mhz, although a marketing number it gives a rough idea of what is possible. If the number is greater than one then more than one instruction per clock is possible.
2 What is the overhead of branch which break the pipeline ?
Anywhere from absolutely no affect to a very dramatic affect, on either system. one clock to hundreds are the potential penalty.
3 What are the effects and overhead of calling a function ?
Depends heavily on the function, and the function calling the function. Depending on the calling convention you might have to save registers to the stack, or rearrange the contents of registers to prepare for the parameters for the function to be called. If passing a struct by value a copy of the struct may need to be made on the stack, the bigger the struct passed the bigger the copy. once in the function a stack frame may need to be prepared, etc, etc. There are many factors involved. This question and answer are also independent of platform.
4 Is there a difference in the analysis between ARM and x86 ?
yes and no, both systems use all the modern tricks of pipelining, branch prediction, etc to keep the mips/mhz up. ARM is going to give a better mips per mhz than x86, x86 being variable instruction length might give more instructions per unit cache. How you analyze the cache, and memory and peripheral systems in the systems side of the analysis is roughly the same. The comparison of the instructions and core are similar and different depending on what aspects you are analyzing. The arm is not microcoded, the x86 likely is so you dont really see how many registers there really are, things like that. at the same time the x86 you can get a better look at the memory system with the arm, since they are generally not system on a chip. Depending on what ARM chip you buy you may lose a lot of the visibility in the boundaries of the chip, might not see all the memory and peripheral busses, for example. (x86 is changing that by putting pcie on chip now for example) in the case of something in the cortex-a class you mentioned you would have similar edge of chip visibility as those would use larger/cheaper dram based memory off chip rather than microcontroller like on chip resources.
Bottom line your final question:
"And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?"
It is definitely NOT TRUE to say the 130 instruction snippet is faster than the 184 instruction snippet. It might be faster it might be slower and it might be about the same. With a lot more information we might be able to make a pretty good statement or it may still be non-deterministic. it is easy to choose 100 instructions that execute faster than 1000 instructions and likewise easy to choose 1000 instructions that execute faster than 100 instructions (even if I were to add no branching and no loops, just linear execution)
Your question is almost entirely meaningless: It probably depends on your input.
Most CPUs have something resembling a branch misprediction penalty (e.g. traditional ARM which throws away an instruction fetch/decode on any taken branch, IIRC). ARM and x86 also allow conditional execution, which can be faster than branching. If either of these are dependent on input data, then different inputs will follow different code paths.
Perhaps one version heavily uses conditional execution, which is wasteful when the condition is false. Perhaps another was compiled using some profiling information that performs no branches (except the return at the end) for a specific case. There are many, many reason why a compiler can take the same source and produce an "optimized" output which is faster for one input and slower for another.
Many optimizations have this characteristic — for example, aligning the start of a loop to 16 bytes helps on some processors, but not when the loop is only executed once.
Some text book answer to this question from Cortex
™
-A Series Programmer’s Guide, chapter 17.
Although cycle timing information can be found in the Technical Reference Manual (TRM) for the processor that you are using, it is very difficult to work out how many cycles even a trivial piece of code will take to execute. The movement of instructions through the pipeline is dependent on the progress of the surrounding instructions and can be significantly affected by memory system
activity. Pending loads or instruction fetches which miss in the cache can stall code for tens of cycles. Standard data processing instructions (logical and arithmetic) will take only one or two cycles to execute, but this does not give the full picture. Instead, we must use profiling tools, or the system performance monitor built-in to the processor, to extract useful information about performance.
Also read under 17.4 Cortex-A9 micro-architecture optimizations which answers your question very very much.

measuring real running time of an algorithm

Approximately, how many physical instructions of MIPS does an abstract algorithm operation amortize to? As for an abstract algorithm operation, I means a basic operation, such as add, divide, etc.
I see this is not a strict measuring technique :-)
Kejia
There is a list of the basic MIPS instructions here. Most of the "basic operations" that you mentioned are a single MIPS instruction or perhaps two, which probably holds true on most current CPU families.
However this does not take into account at all the architecture and performance characteristics of any of the modern CPUs. Different instructions often have diffrent completion times. Current CPUs usually implement branch prediction, instruction pipelines, memory caching, parallelisation and a whole list of other techniques to make the code execution faster.
Therefore just having the assembly code implementation of an algorithm says nothing about its execution speed. You would have to measure and profile the code on the actual hardware to obtain comparable results. In fact, some algorithms may be far more effective on certain CPUs, even within the same CPU family.
A common and rather understandable example is the effect of the instruction cache. Unrolling a loop will eliminate a number of branch operations, which intuitively makes code faster. If you run that code on a CPU of the same family with very little instruction cache memory, though, the added accesses to the main memory can make it far slower than the simple branch-based loop.
Computers are complicated. If you want to get down to this level you need to start considering what kind of CPU you are using, how well your compiler can use this CPU's instruction set, what variables are being kept in what registers, what are their bit-level representations, etc. Even then, the number of instructions not always easily maps to the actual running time. Different instructions can take different ammounts of clock cycles to execute and this is not even thinking about OS threading and your program's cache miss rate.
In the end, there is a good reason we use big-O notatoin in the first place :)
BTW, most simple operations (add, subtract) on integers should map to a single machine instruction, in case you are worried.
It depends on the CPU architecture. Some processors requires several cycles for a single instruction such as divivide, while others manage to execute all machine code instructions in a single cycle each.
It is sometimes relevant to measure an algorithm in how many floating point operations it requires. However this does not take I/O (such as reading memory) into consideration.
The speed of a CPU is sometimes provided in FLOPS (Floating Point OPerations per Second) which could help to give you a time estimate. Again, not taking I/O into consideration - and not multi-threading issues (also a very important measuring factor).
Donald Knuth addressed this very problem in writing Volume 1 of "The Art of Computer Programming".
In the preface he gives a lengthy justification for presenting algorithms in the assembly code for an imaginary machine -
... To avoid this dilemma, I have
attempted to design an "ideal"
computer called "MIX," with very
simple rules of operation ...
That way, one can talk sensibly about how many "cycles" an algorithm would take, without having to care about differences between machines, caching, latency, pipelines, or any of the other ways computers have been optimized to save time, at the expense of knowing how long they will take.

Resources