LPC1768 / ARM Cortex-M3 microsecond delay - gcc

I'm trying to implement a microsecond delay in a bare metal arm environment( LPC1768 ) / GCC. I've seen the examples that use the SysTimer to generate an interrupt that then does some counting in C, which is used as a time base
https://bitbucket.org/jpc/lpc1768/src/dea43fb213ff/main.c
However at 12MHz system clock I don't think that will scale very well to microsecond delays. Basically the processor will spend all it's time servicing the interrupt.
Is it possible to query the value of SYSTICK_GetCurrentValue in a loop and determine how many ticks go in a microsecond and bail out of the loop once the number of ticks exceeds the calculated number?
I'd rather not use a separate hardware timer for this (but will if there is no other choice)

One way is just to use a loop to create the delay, something like shown below. You need to calibrate your factor. A more general purpose approach is to calculate the factor on startup based on some known timebase.
#define CAL_FACTOR ( 100 )
void delay (uint32_t interval)
{
uint32_t iterations = interval / CAL_FACTOR;
for(int i=0; i<iterations; ++i)
{
__asm__ volatile // gcc-ish syntax, don't know what compiler is used
(
"nop\n\t"
"nop\n\t"
:::
);
}
}

First off interrupts are not needed for this sort of thing, you can poll the timer, no need to overkill with an interrupt. yes reasons why those examples use interrupts, but that doesnt mean that is the only way to use a timer.
Guy Sirton's answer is sound, but I prefer assembler as I can control it exactly to the clock cycle (so long as there are no interrupts or other items that get in the way). A timer is usually easier though as the code is a bit more portable (change the processor clocks frequency and you have to re-tune the loop, with a timer, sometimes all you have to do is change the init code to use a different prescaler, or change the one line looking for the computed count), and allows for interrupts and such things in the system.
In this case though you are talking about 12mhz, and one microsecond, that is 12 instructions yes? Put in 12 nops. Or branch to some assembler with like 10 nops or 8, whatever it comes out to compensate for the pipeline flush on the two branches. A timer and interrupts is going to burn more than 12 instruction cycles in overhead. Even polling the timer in a loop is going to be sloppy. A counter loop would work too, you need to understand the branch costs though and tune for that:
delay_one_ms:
mov r0,#3
wait:
sub r0,#1 #cortex-m3 means thumb/thumb2 and gas complains about subs.
bne wait
nop #might need some nops to tune the loop accurately
nop
bx lr
Call this function, what 30 million times in a loop using a gpio led or uart output and a stop watch and see that the blinks are 30 seconds apart.
ldr r4,=uart_tx_register_address
mov r5,#0x55
again:
ldr r6,=24000000
str r5,[r4]
top:
bl delay_one_ms
sub r6,#1
bne top
str r5,[r4]
b again
Actually since I assumed 2 clocks per branch, the test loop has 3 clocks, the delay is assumed to be a total 12 clocks so 15 clocks per loop, 30 seconds is 30,000,000 microseconds, ideally 30million loops, but I needed 12/15ths the number of loops to compensate. This is far easier if you have an oscilloscope whose timebase is somewhat accurate, or at least as accurate as you want this delay.
I have not studied ARM's branch costs myself otherwise I would comment on that. It is likely two or three clocks. So the mov is one, the sub is one times the number of loops the bne is lets say two times the number of loops. Two for the branch to get here two for the return. 5+(3*loops)+nops=12. (3*loops)+nops=7 loops is 2 and nops is 1, yes? I think stringing a number of nops together is far easier:
delay_one_ms:
nop
nop
nop
nop
nop
nop
nop
nop
bx lr
You might have to burn a few more instructions temporarily disabling interrupts, if you use them. If you are looking for "at least" one microsecond then dont worry about it.

You can use SYSTICK inside ARM processor for that. Just program it to count each 1uS, or less if you have enough clock speed, and do-while a loop till your delay value expires.
Like this:
void WaitUs(int us) {
unsigned int cnt;
while(us-- >0) {
cnt = STK_VAL; // get systick counter, ticking each 500nS
while( (STK_VAL-cnt) < 2); // repeat till 2 ticks
}
}
Bear in mind that this is an example, you'll need to adjust it for counter roll-over among other things.

Related

Loop takes less than 1 cycle despite dependency between iterations

I wanted to benchmark the time needed to do a single addition on my Skylake (i5-6500) CPU. C is low-level enough for me, so I wrote the following code:
// Initializing stuffs
int a = rand();
int b = rand();
const unsigned long loop_count = 1000000000;
unsigned int ignored; // used for __rdtscp
// Warming up whatever needs to be warmed up
for (int i = 0; i < 100000; i++) {
asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
a += b;
}
// The actual measurement
uint64_t timer = __rdtscp(&ignored);
for (unsigned long i = 0; i < loop_count; i++) {
asm volatile("" : "+r" (a)); // prevents Clang from replacing the loop with a multiplication
a += b;
}
timer = __rdtscp(&ignored) - timer;
printf("%.2f cycles/iteration\n", (double)timer / loop_count);
Compiling with Clang 7.0.0 -O3, I get the following assembly (for the loop only):
# %bb.2:
rdtscp
movq %rdx, %rdi
movl %ecx, 4(%rsp)
shlq $32, %rdi
orq %rax, %rdi
movl $1000000000, %eax # imm = 0x3B9ACA00
.p2align 4, 0x90
.LBB0_3: # =>This Inner Loop Header: Depth=1
#APP
#NO_APP
addl %esi, %ebx
addq $-1, %rax
jne .LBB0_3
# %bb.4:
rdtscp
And running this code outputs
0.94 cycles/iteration
(or a number pretty much always between 0.93 and 0.96)
I'm surprised that this loop can execute in less than 1 cycle/iteration, since there is a data dependency on a that should prevent parallel execution of a += b.
IACA also confirms that the expected throughput is 0.96 cycles. llvm-mca on the other hand predicts a total of 104 cycles to execute 100 iterations of the loop. (I can edit in the traces if needed; let me know)
I observe a similar behavior when I use SSE registers rather than general purpose ones.
I can imagine that the CPU is smart enough to notice that b is constant and since addition is commutative, it could unroll the loop and optimize the additions somehow. However, I've never heard nor read anything about this. And furthermore, if this was what's going on, I'd expect better performances (ie. fewer cycles/iteration) than 0.94 cycles/iteration.
What is going on? How is this loop able to execute in less than 1 cycle per iteration?
Some background, for completeness. Ignore the remaining of the question if you're not interested in why I'm trying to benchmark a single addition.
I know that there are tools (llvm-exegesis for instance) designed to benchmark a single instruction and that I should instead of them (or just look at agner fog's docs). However, I'm actually trying to compare three different additions: one doing a single addition in a loop (the object of my question); one doing 3 additions per loop (on SSE registers, that should maximize port usage and not be limited by data dependencies), and one where the addition is implemented as a circuit in software. While the results are mostly as I expected; the 0.94 cycles/iteration for the version with a single addition in a loop left me puzzled.
The core frequency and the TSC frequency can be different. Your loop is expected to run at 1 core cycles per iteration. If the core frequency happens to be twice as the TSC frequency for the duration of the loop execution, the throughput would be 0.5 TSC cycles per iteration, which is equivalent to 1 core cycles per iteration.
In your case, it appears that the average core frequency was slightly higher than the TSC frequency. If you don't want to take dynamic frequency scaling into account when doing experiments, it'd be easier to just fix the core frequency to be equal to the TSC frequency so that you don't have to convert the numbers. Otherwise, you'd have to measure the average the core frequency as well.
On processors that support per-core frequency scaling, you have to either fix the frequency on all the cores or pin the experiments to a single core with fixed frequency. Alternatively, instead of measuring in TSC cycles, you can use a tool like perf to easily measure the time in core cycles or seconds.
See also: How to get the CPU cycle count in x86_64 from C++?.

What's up with the "half fence" behavior of rdtscp?

For many years x86 CPUs supported the rdtsc instruction, which reads the "time stamp counter" of the current CPU. The exact definition of this counter has changed over time, but on recent CPUs it is a counter that increments at a fixed frequency with respect to wall clock time, so it is very useful as building block for a fast, accurate clock or measuring the time taken by small segments of code.
One important fact about the rdtsc instruction isn't ordered in any special way with the surrounding code. Like most instructions, it can be freely reordered with respect to other instructions which it isn't in a dependency relationship with. This is actually "normal" and for most instructions it's just a mostly invisible way of making the CPU faster (it's just a long winder way of saying out-of-order execution).
For rdtsc it is important because it means you might not be timing the code you expect to be timing. For example, given the following sequence1:
rdtsc
mov ecx, eax
mov rdi, [rdi]
mov rdi, [rdi]
rdtsc
You might expect rdtsc to measure the latency of the two pointer chasing loading loads mov rdi, [rdi]. In practice, however, even if both of these loads take a look time (100s of cycles if they miss in the cache), you'll get a fairly small reading for the rdtsc pair. The problem is that the second rdtsc doens't wait for the loads to finish, it just executes out of order, so you aren't timing the interval you think you are. Perhaps both rdtsc instruction actually even execute before the first load even starts, depending how rdi was calculated in the code prior to this example.
So far, this is sounding more like an answer to a question nobody asked than a real question, but I'm getting there.
You have two basic use-cases for rdtsc:
As a quick timestamp, in which can you usually don't care exactly how it reorders with the surrounding code, since you probably don't have have an instruction-level concept of where the timestamp should be taken, anyways.
As a precise timing mechanism, e.g., in a micro-benchmark. In this case you'll usually protect your rdtsc from re-ordering with the lfence instruction. For the example above, you might do something like:
lfence
rdtsc
lfence
mov ecx, eax
...
lfence
rdtsc
To ensure the timed instructions (...) don't escape outside of the timed region, and also to ensure instructions from inside the time region don't come in (probably less of a problem, but they may compete for resources with the code you want to measure).
Years later, Intel looked down upon us poor programmers and came up with a new instruction: rdtscp. Like rdtsc it returns a reading of the time stamp counter, and this guy does something more: it reads a core-specific MSR value atomically with the timestamp reading. On most OSes this contains a core ID value. I think the idea is that this value can be used to properly adjust the returned value to real time on CPUs that may have different TSC offsets per core.
Great.
The other thing rdtscp introduced was half-fencing in terms of out-of-order execution:
From the manual:
The RDTSCP instruction is not a serializing instruction, but it does
wait until all previous instructions have executed and all previous
loads are globally visible.1 But it does not wait for previous stores
to be globally visible, and subsequent instructions may begin
execution before the read operation is performed.
So it's like putting an lfence before the rdtscp, but not after. What is the point of this half-fencing behavior? If you want a general timestamp and don't care about instruction ordering, the unfenced behavior is what you want. If you want to use this for timing short code sections, the half-fencing behavior is useful only for the second (final) reading, but not for the initial reading, since the fence is on the "wrong" side (in practice you want fences on both sides, but having them on the inside is probably the most important).
What purpose does such half-fencing serve?
1 I'm ignoring the upper 32-bits of the counter in this case.

Why are loops always compiled into "do...while" style (tail jump)?

When trying to understand assembly (with compiler optimization on), I see this behavior:
A very basic loop like this
outside_loop;
while (condition) {
statements;
}
Is often compiled into (pseudocode)
; outside_loop
jmp loop_condition ; unconditional
loop_start:
loop_statements
loop_condition:
condition_check
jmp_if_true loop_start
; outside_loop
However, if the optimization is not turned on, it compiles to normally understandable code:
loop_condition:
condition_check
jmp_if_false loop_end
loop_statements
jmp loop_condition ; unconditional
loop_end:
According to my understanding, the compiled code is better resembled by this:
goto condition;
do {
statements;
condition:
}
while (condition_check);
I can't see a huge performance boost or code readability boost, so why is this often the case? Is there a name for this loop style, for example "trailing condition check"?
Related: asm loop basics: While, Do While, For loops in Assembly Language (emu8086)
Terminology: Wikipedia says "loop inversion" is the name for turning a while(x) into if(x) do{}while(x), putting the condition at the bottom of the loop where it belongs.
Fewer instructions / uops inside the loop = better. Structuring the code outside the loop to achieve this is very often a good idea.
Sometimes this requires "loop rotation" (peeling part of the first iteration so the actual loop body has the conditional branch at the bottom). So you do some of the first iteration and maybe skip the loop entirely, then fall into the loop. Sometimes you also need some code after the loop to finish the last iteration.
Sometimes loop rotation is extra useful if the last iteration is a special case, e.g. a store you need to skip. This lets you implement a while(1) {... ; if(x)break; ...; } loop as a do-while, or put one of the conditions of a multiple-condition loop at the bottom.
Some of these optimizations are related to or enable software pipelining, e.g. loading something for the next iteration. (OoO exec on x86 makes SW pipelining not very important these days but it's still useful for in-order cores like many ARM. And unrolling with multiple accumulators is still very valuable for hiding loop-carried FP latency in a reduction loop like a dot product or sum of an array.)
do{}while() is the canonical / idiomatic structure for loops in asm on all architectures, get used to it. IDK if there's a name for it; I would say such a loop has a "do while structure". If you want names, you could call the while() structure "crappy unoptimized code" or "written by a novice". :P Loop-branch at the bottom is universal, and not even worth mentioning as a Loop Optimization. You always do that.
This pattern is so widely used that on CPUs that use static branch prediction for branches without an entry in the branch-predictor caches, unknown forward conditional branches are predicted not-taken, unknown backwards branches are predicted taken (because they're probably loop branches). See Static branch prediction on newer Intel processors on Matt Godbolt's blog, and Agner Fog's branch-prediction chapter at the start of his microarch PDF.
This answer ended up using x86 examples for everything, but much of this applies across the board for all architectures. I wouldn't be surprised if other superscalar / out-of-order implementations (like some ARM, or POWER) also have limited branch-instruction throughput whether they're taken or not. But fewer instructions inside the loop is nearly universal when all you have is a conditional branch at the bottom, and no unconditional branch.
If the loop might need to run zero times, compilers more often put a test-and-branch outside the loop to skip it, instead of jumping to the loop condition at the bottom. (i.e. if the compiler can't prove the loop condition is always true on the first iteration).
BTW, this paper calls transforming while() to if(){ do{}while; } an "inversion", but loop inversion usually means inverting a nested loop. (e.g. if the source loops over a row-major multi-dimensional array in the wrong order, a clever compiler could change for(i) for(j) a[j][i]++; into for(j) for(i) a[j][i]++; if it can prove it's correct.) But I guess you can look at the if() as a zero-or-one iteration loop. Fun fact, compiler devs teaching their compilers how to invert a loop (to allow auto-vectorization) for a (very) specific case is why SPECint2006's libquantum benchmark is "broken". Most compilers can't invert loops in the general case, just ones that looks almost exactly like the one in SPECint2006...
You can help the compiler make more compact asm (fewer instructions outside the loop) by writing do{}while() loops in C when you know the caller isn't allowed to pass size=0 or whatever else guarantees a loop runs at least once.
(Actually 0 or negative for signed loop bounds. Signed vs. unsigned loop counters is a tricky optimization issue, especially if you choose a narrower type than pointers; check your compiler's asm output to make sure it isn't sign-extending a narrow loop counter inside the loop very time if you use it as an array index. But note that signed can actually be helpful, because the compiler can assume that i++ <= bound will eventually become false, because signed overflow is UB but unsigned isn't. So with unsigned, while(i++ <= bound) is infinite if bound = UINT_MAX.) I don't have a blanket recommendation for when to use signed vs. unsigned; size_t is often a good choice for looping over arrays, though, but if you want to avoid the x86-64 REX prefixes in the loop overhead (for a trivial saving in code size) but convince the compiler not to waste any instructions zero or sign-extending, it can be tricky.
I can't see a huge performance boost
Here's an example where that optimization will give speedup of 2x on Intel CPUs before Haswell, because P6 and SnB/IvB can only run branches on port 5, including not-taken conditional branches.
Required background knowledge for this static performance analysis: Agner Fog's microarch guide (read the Sandybridge section). Also read his Optimizing Assembly guide, it's excellent. (Occasionally outdated in places, though.) See also other x86 performance links in the x86 tag wiki. See also Can x86's MOV really be "free"? Why can't I reproduce this at all? for some static analysis backed up by experiments with perf counters, and some explanation of fused vs. unfused domain uops.
You could also use Intel's IACA software (Intel Architecture Code Analyzer) to do static analysis on these loops.
; sum(int []) using SSE2 PADDD (dword elements)
; edi = pointer, esi = end_pointer.
; scalar cleanup / unaligned handling / horizontal sum of XMM0 not shown.
; NASM syntax
ALIGN 16 ; not required for max performance for tiny loops on most CPUs
.looptop: ; while (edi<end_pointer) {
cmp edi, esi ; 32-bit code so this can macro-fuse on Core2
jae .done ; 1 uop, port5 only (macro-fused with cmp)
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
jmp .looptop ; 1 uop, p5 only
; Sandybridge/Ivybridge ports each uop can use
.done: ; }
This is 4 total fused-domain uops (with macro-fusion of the cmp/jae), so it can issue from the front-end into the out-of-order core at one iteration per clock. But in the unfused domain there are 4 ALU uops and Intel pre-Haswell only has 3 ALU ports.
More importantly, port5 pressure is the bottleneck: This loop can execute at only one iteration per 2 cycles because cmp/jae and jmp both need to run on port5. Other uops stealing port5 could reduce practical throughput somewhat below that.
Writing the loop idiomatically for asm, we get:
ALIGN 16
.looptop: ; do {
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
cmp edi, esi ; 1 uop, port5 only (macro-fused with cmp)
jb .looptop ; } while(edi < end_pointer);
Notice right away, independent of everything else, that this is one fewer instruction in the loop. This loop structure is at least slightly better on everything from simple non-pipelined 8086 through classic RISC (like early MIPS), especially for long-running loops (assuming they don't bottleneck on memory bandwidth).
Core2 and later should run this at one iteration per clock, twice as fast as the while(){}-structured loop, if memory isn't a bottleneck (i.e. assuming L1D hits, or at least L2 actually; this is only SSE2 16-bytes per clock).
This is only 3 fused-domain uops, so can issue at better than one per clock on anything since Core2, or just one per clock if issue groups always end with a taken branch.
But the important part is that port5 pressure is vastly reduced: only cmp/jb needs it. The other uops will probably be scheduled to port5 some of the time and steal cycles from loop-branch throughput, but this will be a few % instead of a factor of 2. See How are x86 uops scheduled, exactly?.
Most CPUs that normally have a taken-branch throughput of one per 2 cycles can still execute tiny loops at 1 per clock. There are some exceptions, though. (I forget which CPUs can't run tight loops at 1 per clock; maybe Bulldozer-family? Or maybe just some low-power CPUs like VIA Nano.) Sandybridge and Core2 can definitely run tight loops at one per clock. They even have loop buffers; Core2 has a loop buffer after instruction-length decode but before regular decode. Nehalem and later recycle uops in the queue that feeds the issue/rename stage. (Except on Skylake with microcode updates; Intel had to disable the loop buffer because of a partial-register merging bug.)
However, there is a loop-carried dependency chain on xmm0: Intel CPUs have 1-cycle latency paddd, so we're right up against that bottleneck, too. add esi, 16 is also 1 cycle latency. On Bulldozer-family, even integer vector ops have 2c latency, so that would bottleneck the loop at 2c per iteration. (AMD since K8 and Intel since SnB can run two loads per clock, so we need to unroll anyway for max throughput.) With floating point, you definitely want to unroll with multiple accumulators. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators).
If I'd used an indexed addressing mode, like paddd xmm0, [edi + eax], I could have used sub eax, 16 / jnc at the loop condition. SUB/JNC can macro-fuse on Sandybridge-family, but the indexed load would un-laminate on SnB/IvB (but stay fused on Haswell and later, unless you use the AVX form).
; index relative to the end of the array, with an index counting up towards zero
add rdi, rsi ; edi = end_pointer
xor eax, eax
sub eax, esi ; eax = -length, so [rdi+rax] = first element
.looptop: ; do {
paddd xmm0, [rdi + rax]
add eax, 16
jl .looptop ; } while(idx+=16 < 0); // or JNC still works
(It's usually better to unroll some to hide the overhead of pointer increments instead of using indexed addressing modes, especially for stores, partly because indexed stores can't use the port7 store AGU on Haswell+.)
On Core2/Nehalem add/jl don't macro-fuse, so this is 3 fused-domain uops even in 64-bit mode, without depending on macro-fusion. Same for AMD K8/K10/Bulldozer-family/Ryzen: no fusion of the loop condition, but PADDD with a memory operand is 1 m-op / uop.
On SnB, paddd un-laminates from the load, but add/jl macro-fuse, so again 3 fused-domain uops. (But in the unfused domain, only 2 ALU uops + 1 load, so probably fewer resource conflicts reducing throughput of the loop.)
On HSW and later, this is 2 fused-domain uops because an indexed load can stay micro-fused with PADDD, and add/jl macro-fuses. (Predicted-taken branches run on port 6, so there are never resource conflicts.)
Of course, the loops can only run at best 1 iteration per clock because of taken branch throughput limits even for tiny loops. This indexing trick is potentially useful if you had something else to do inside the loop, too.
But all of these loops had no unrolling
Yes, that exaggerates the effect of loop overhead. But gcc doesn't unroll by default even at -O3 (unless it decides to fully unroll). It only unrolls with profile-guided optimization to let it know which loops are hot. (-fprofile-use). You can enable -funroll-all-loops, but I'd only recommend doing that on a per-file basis for a compilation unit you know has one of your hot loops that needs it. Or maybe even on a per-function basis with an __attribute__, if there is one for optimization options like that.
So this is highly relevant for compiler-generated code. (But clang does default to unrolling tiny loops by 4, or small loops by 2, and extremely importantly, using multiple accumulators to hide latency.)
Benefits with very low iteration count:
Consider what happens when the loop body should run once or twice: There's a lot more jumping with anything other than do{}while.
For do{}while, execution is a straight-line with no taken branches and one not-taken branch at the bottom. This is excellent.
For an if() { do{}while; } that might run the loop zero times, it's two not-taken branches. That's still very good. (Not-taken is slightly cheaper for the front-end than taken when both are correctly predicted).
For a jmp-to-the-bottom jmp; do{}while(), it's one taken unconditional branch, one taken loop condition, and then the loop branch is not-taken. This is kinda clunky but modern branch predictors are very good...
For a while(){} structure, this is one not-taken loop exit, one taken jmp at the bottom, then one taken loop-exit branch at the top.
With more iterations, each loop structure does one more taken branch. while(){} also does one more not-taken branch per iteration, so it quickly becomes obviously worse.
The latter two loop structures have more jumping around for small trip counts.
Jumping to the bottom also has a disadvantage for non-tiny loops that the bottom of the loop might be cold in L1I cache if it hasn't run for a while. Code fetch / prefetch is good at bringing code to the front-end in a straight line, but if prediction didn't predict the branch early enough, you might have a code miss for the jump to the bottom. Also, parallel decode will probably have (or could have) decoded some of the top of the loop while decoding the jmp to the bottom.
Conditionally jumping over a do{}while loop avoids all that: you only jump forwards into code that hasn't been run yet in cases where the code you're jumping over shouldn't run at all. It often predicts very well because a lot of code never actually takes 0 trips through the loop. (i.e. it could have been a do{}while, the compiler just didn't manage to prove it.)
Jumping to the bottom also means the core can't start working on the real loop body until after the front-end chases two taken branches.
There are cases with complicated loop conditions where it's easiest to write it this way, and the performance impact is small, but compilers often avoid it.
Loops with multiple exit conditions:
Consider a memchr loop, or a strchr loop: they have to stop at the end of the buffer (based on a count) or the end of an implicit-length string (0 byte). But they also have to break out of the loop if they find a match before the end.
So you'll often see a structure like
do {
if () break;
blah blah;
} while(condition);
Or just two conditions near the bottom. Ideally you can test multiple logical conditions with the same actual instruction (e.g. 5 < x && x < 25 using sub eax, 5 / cmp eax, 20 / ja .outside_range, unsigned compare trick for range checking, or combine that with an OR to check for alphabetic characters of either case in 4 instructions) but sometimes you can't and just need to use an if()break style loop-exit branch as well as a normal backwards taken branch.
Further reading:
Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” for good ways to look at compiler output (e.g. what kind of inputs give interesting output, and a primer on reading x86 asm for beginners). related: How to remove "noise" from GCC/clang assembly output?
Modern Microprocessors A 90-Minute Guide!. Details look at superscalar pipelined CPUs, mostly architecture neutral. Very good. Explains instruction-level parallelism and stuff like that.
Agner Fog's x86 optimization guide and microarch pdf. This will take you from being able to write (or understand) correct x86 asm to being able to write efficient asm (or see what the compiler should have done).
other links in the x86 tag wiki, including Intel's optimization manuals. Also several of my answers (linked in the tag wiki) have things that Agner missed in his testing on more recent microarchitectures (like un-lamination of micro-fused indexed addressing modes on SnB, and partial register stuff on Haswell+).
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators): how to use multiple accumulators to hide latency of a reduction loop (like an FP dot product).
Lecture 7: Loop Transformations (also on archive.org). Lots of cool stuff that compilers do to loops, using C syntax to describe the asm.
https://en.wikipedia.org/wiki/Loop_optimization
Sort of off topic:
Memory bandwidth is almost always important, but it's not widely known that a single core on most modern x86 CPUs can't saturate DRAM, and not even close on many-core Xeons where single-threaded bandwidth is worse than on a quad-core with dual channel memory controllers.
How much of ‘What Every Programmer Should Know About Memory’ is still valid? (my answer has commentary on what's changed and what's still relevant in Ulrich Drepper's well-known excellent article.)

Deterministic execution time on x86-64

Is there an x64 instruction(s) that takes a fixed amount of time, regardless of the micro-architectural state such as caches, branch predictors, etc.?
For instance, if a hypothetical add or increment instruction always takes n cycles, then I can implement a timer in my program by performing that add instruction multiple times. Perhaps an increment instruction with register operands may work, but it's not clear to me whether Intel's spec guarantees that it would take deterministic number of cycles. Note that I am not interested in current time, but only a primitive / instruction sequence that takes a fixed number of cycles.
Assume that I have a way to force atomic execution i.e. no context switches during timer's execution i.e. only my program gets to run.
On a related note, I also cannot use system services to keep track of time, because I am working in a setting where my program is a user-level program running on an untrusted OS.
The x86 ISA documents don't guarantee anything about what takes a certain amount of cycles. The ISA allows things like Transmeta's Crusoe that JIT-compiled x86 instructions to an internal VLIW instruction set. It could conceivably do optimizations between adjacent instructions.
The best you can do is write something that will work on as many known microarchitectures as possible. I'm not aware of any x86-64 microarchitectures that are "weird" like Transmeta, only the usual superscalar decode-to-uops designs like Intel and AMD use.
Simple integer ALU instructions like ADD are almost all 1c latency, and tiny loops that don't touch memory are almost totally unaffected anything, and are very predictable. If they run a lot of iterations, they're also almost totally unaffected by anything to do with the impact of surrounding code on the out-of-order core, and recover very quickly from disruptions like timer interrupts.
On nearly every Intel microarchitecture, this loop will run at one iteration per clock:
mov ecx, 1234567 ; or use a 64-bit register for higher counts.
ALIGN 16
.loop:
sub ecx, 1 ; not dec because of Pentium 4.
jnz .loop
Agner Fog's microarch guide and instruction tables say that VIA Nano3000 has a taken-branch throughput of one per 3 cycles, so this loop would only run at one iteration per 3 clocks there. AMD Bulldozer-family and Jaguar similarly have a max throughput of one taken JCC per 2 clocks.
See also other performance links in the x86 tag wiki.
If you want a more power-efficient loop, you could use PAUSE in the loop, but it waits ~100 cycles on Skylake, up from ~5 cycles on previous microarchitectures. (You can make cycle-accurate predictions for more complicated loops that don't touch memory, but that depends on microarchitectural details.)
You could make a more reliable loop that's less likely to have different bottlenecks on different CPUs by making a longer dependency chain within each iteration. Since each instruction depends on the previous, it can still only run at one instruction per cycle (not counting the branch), drastically the branches per cycle.
# one add/sub per clock, limited by latency
# should run one iteration per 6 cycles on every CPU listed in Agner Fog's tables
# And should be the same on all future CPUs unless they do magic inter-instruction optimizations.
# Or it could be slower on CPUs that always have a bubble on taken branches, but it seems unlikely anyone would design one.
ALIGN 16
.loop:
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 1 ; net result ecx+0
add ecx, 1
sub ecx, 2 ; net result ecx-1
jnz .loop
Unrolling like this ensures that front-end effects are not a bottleneck. It gives the frontend decoders plenty of time to queue up the 6 add/sub insns and the jcc before the next branch.
Using add/sub instead of dec/inc avoids a partial-flag false dependency on Pentium 4. (Although I don't think that would be an issue anyway.)
Pentium4's double-clocked ALUs can each run two ADDs per clock, but the latency is still one cycle. i.e. apparently it can't forward a result internally to chew through this dependency chain twice as fast as any other CPU.
And yes, Prescott P4 is an x86-64 CPU, so we can't quite ignore P4 if we need a general purpose answer.

Can I atomically increment a 16 bit counter on x86/x86_64?

I want to save memory by converting an existing 32 bit counter to a 16 bit counter. This counter is atomically incremented/decremented. If I do this:
What instructions do I use for atomic_inc(uint16_t x) on x86/x86_64?
Is this reliable in multi-processor x86/x86_64 machines?
Is there a performance penalty to pay on any of these architectures for doing this?
If yes for (3), what's the expected performance penalty?
Thanks for your comments!
Here's one that uses GCC assembly extensions, as an alternative to Steve's Delphi answer:
uint16_t atomic_inc(uint16_t volatile* ptr)
{
uint16_t value(1);
__asm__("lock xadd %w0, %w1" : "+r" (value) : "m" (*ptr));
return ++value;
}
Change the 1 with -1, and the ++ with --, for decrement.
Here is a Delphi function that works:
function LockedInc( var Target :WORD ) :WORD;
asm
mov ecx, eax
mov ax, 1
Lock xadd [ecx], ax
Inc eax
end;
I guess you could convert it to whichever language you require.
The simplest way to perform an atomic increase is as follows (this is inline ASM):
asm
lock inc dword ptr Counter;
end;
where J is an integer. This will directly increase Counter in its memory location.
I have tested this with brute force and it works 100%.
To answer the other three questions:
Didn't find a way to make a numbered list starting with 2
Yes, this is reliable in a multiprocessor environment
Yes, there is a performance penalty
The "lock" prefix locks down the busses, not only for the processor, but for any external hardware, which may want to access the bus via DMA (mass storage, graphics...). So it is slow, typically ~100 clock cycles, but it may be more costly. But if you have "megabytes" of counters, chances are, you will be facing a cache miss, and in this case you will have to wait about ~100 clocks anyway (the memory access time), in case of a page miss, several hundred, so the overhead from lock might not matter.

Resources