Why aren't out of order CPUs troublesome? - cpu

i've recently learned about out-of-order execution CPUs in this link https://en.wikipedia.org/wiki/Out-of-order_execution
There is something that i can't quite understand. Why aren't these kind of CPUs troublesome? I mean, if i have instructions executing out of order, even if they apply to different data, wont i be able to reach a situation where data is not updated according to the program order?
I mean, if i have something like:
x = 1;
y = 2;
x = x+y;
print x;
print y;
what prevents the "print y" instruction from being executed before the "print x"?
Maybe i'm getting something wrong about this kind of CPU, can you explain it to me?
Thanks in advance

In an out-of-order processor, the instructions are executed out-of-order but committed in order. This means that externally visible state becomes visible in order. Writes to memory (including I/O operations) are buffered locally so that they can be released to the external memory system in order, similarly register results are stored locally using register renaming. This allows the local processor to use the early speculative values without violating externally visible ordering. On incorrect speculation, a rollback mechanism is used to restore state back to a previous known-valid state.
Note that technically commitment of results to processor-core-external state does not have to be in order as long as the out-of-order result is non-speculative and does not violate ordering guarantees. With weak memory consistency models, this could (in theory) allow values to become externally visible out-of-order. (I/O is required to be in-order, so the print example would still be required to commit in order.) Furthermore, if other cores know that values are speculative in nature or order, the values could be made externally visible out-of-order (again, in theory) and consumers of the values would rollback on incorrect speculation. (This is extending "externally visible" from external to the single processor core to external to some larger system component which is aware of speculation and supports rollback.)
(In very impractical theory, extending speculative realization to the human computer interface would be possible if the definition of the interface allowed such glitches (i.e., the human corrects for wrongly speculated values and ordering). However, since humans are even slower than most I/O devices (speculations would be resolved on a shorter time scale than is significant) and such an extension of speculation would be extremely complex and generally undesirable, it is unlikely ever to be broadly used.)

#paul-a-clayton already gave a great answer. I would like to mention one thing more.
Earlier out-of-order architectures were notoriously troublesome. The fundamental problem was that they couldn't guarantee precise interrupts. This changed in the late 80s/early 90s due to the solutions proposed in Smith & Pleszkun's seminal paper. They introduced the idea of a reorder buffer which allows instructions to issue out of order but commit in order.

Related

Can CPU Out-of-Order-Execution cause memory reordering?

I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.
In my opinion, Out-of-Order-Execution can't cause reordering because the results are always retired in-order as mentioned in this question.
To make my question more clear, let's say we have such an relax memory consistency architecture:
It doesn't have store buffer and invalidate queues
It can do Out-of-Order-Execution
Can memory reordering still happen in this architecture?
Does memory barrier has two functions, one is forbidding the Out-of-Order execution, the other is flushing invalidation queue and draining store buffer?
Yes, out of order execution can definitely cause memory reordering, such as load/load re-ordering
It is not so much a question of the loads being retired in order, as of when the load value is bound to the load instruction. Eg Load1 may precede Load2 in program order, Load2 gets its value from memory before Load1 does, and eg if there is an intervening store to the location read by Load2, then Load/load reordering has occurred.
However, certain systems, such as Intel P6 family systems, have additional mechanisms to detect such conditions to obtain stronger memory order models.
In these systems all loads are buffered until retirement, and if a possible store is detected to such a buffered but not yet retired load, then the load and program order instructions are “nuked”, and execution is resumed art, e.g., Load2.
I call this Freye’s Rule snooping, after I learned that Brad Freye at IBM had invented it many years before I thought I had. I believe the standard academic reference is Gharachorloo.
I.e. it is not so much buffering loads until retirement, as it is providing such a detection and correction mechanism associated with buffering loads until retirement. Many CPUs provide buffering until retirement but do not provide this detection mechanism.
Note also that this requires something like snoop based cache coherence. Many systems, including Intel systems that have such mechanisms also support noncoherent memory, e.g. memory that may be cached but which is managed by software. If speculative loads are allowed to such cacheable but non-coherent memory regions, the Freye’s Rule mechanism will not work and memory will be weakly ordered.
Note: I said “buffer until retirement”, but if you think about it you can easily come up with ways of buffering not quite until retirement. E.g. you can stop this snooping when all earlier loads have them selves been bound, and there is no longer any possibility of an intervening store being observed even transitively.
This can be important, because there is quite a lot of performance to be gained by “early retirement“, removing instructions such as loads from buffering and repair mechanisms before all earlier instructions have retired. Early retirement can greatly reduce the cost of out of order hardware mechanisms.

Which std::sync::atomic::Ordering to use?

All the methods of std::sync::atomic::AtomicBool take a memory ordering (Relaxed, Release, Acquire, AcqRel, and SeqCst), which I have not used before. Under what circumstances should these values be used? The documentation uses confusing “load” and “store” terms which I don’t really understand. For example:
A producer thread mutates some state held by a Mutex, then calls AtomicBool::compare_and_swap(false, true, ordering) (to coalesce invalidations), and if it swapped, posts an “invalidate” message to a concurrent queue (e.g. mpsc or a winapi PostMessage). A consumer thread resets the AtomicBool, reads from the queue, and reads the state held by the Mutex. Can the producer use Relaxed ordering because it is preceded by a mutex, or must it use Release? Can the consumer use store(false, Relaxed), or must it use compare_and_swap(true, false, Acquire) to receive the changes from the mutex?
What if the producer and consumer share a RefCell instead of a Mutex?
I'm not an expert on this, and it's really complicated, so please feel free to critique my post. As pointed out by mdh.heydari, cppreference.com has much better documentation of orderings than Rust (C++ has an almost identical API).
For your question
You'd need to use "release" ordering in your producer and "acquire" ordering in your consumer. This ensures that the data mutation occurs before the AtomicBool is set to true.
If your queue is asynchronous, then the consumer will need to keep trying to read from it in a loop, since the producer could get interrupted between setting the AtomicBool and putting something in the queue.
If the producer code might run multiple times before client runs, then you can't use RefCell because they could mutate the data while the client is reading it. Otherwise it's fine.
There are other better and simpler ways to implement this pattern, but I assume you were just giving it as an example.
What are orderings?
The different orderings have to do with what another thread sees happen when an atomic operation occurs. Compilers and CPUs are normally both allowed to reorder instructions in order to optimize code, and the orderings effect how much they're allowed to reorder instructions.
You could just always use SeqCst, which basically guarantees everyone will see that instruction as having occurred wherever you put it relative to other instructions, but in some cases if you specify a less restrictive ordering then LLVM and the CPU can better optimize your code.
You should think of these orderings as applying to a memory location (instead of applying to an instruction).
Ordering Types
Relaxed Ordering
There are no constraints besides any modification to the memory location being atomic (so it either happens completely or not at all). This is fine for something like a counter if the values retrieved by/set by individual threads don't matter as long as they're atomic.
Acquire Ordering
This constraint says that any variable reads that occur in your code after "acquire" is applied can't be reordered to occur before it. So, say in your code you read some shared memory location and get value X, which was stored in that memory location at time T, and then you apply the "acquire" constraint. Any memory locations that you read from after applying the constraint will have the value they had at time T or later.
This is probably what most people would expect to happen intuitively, but because a CPU and optimizer are allowed to reorder instructions as long as they don't change the result, it isn't guaranteed.
In order for "acquire" to be useful, it has to be paired with "release", because otherwise there's no guarantee that the other thread didn't reorder its write instructions that were supposed to occur at time T to an earlier time.
Acquire-reading the flag value you're looking for means you won't see a stale value somewhere else that was actually changed by a write before the release-store to the flag.
Release Ordering
This constraint says that any variable writes that occur in your code before "release" is applied can't be reordered to occur after it. So, say in your code you write to a few shared memory locations and then set some memory location t at time T, and then you apply the "release" constraint. Any writes that appear in your code before "release" is applied are guaranteed to have occurred before it.
Again, this is what most people would expect to happen intuitively, but it isn't guaranteed without constraints.
If the other thread trying to read value X doesn't use "acquire", then it isn't guaranteed to see the new value with respect to changes in other variable values. So it could get the new value, but it might not see new values for any other shared variables. Also keep in mind that testing is hard. Some hardware won't in practice show re-ordering with some unsafe code, so problems can go undetected.
Jeff Preshing wrote a nice explanation of acquire and release semantics, so read that if this isn't clear.
AcqRel Ordering
This does both Acquire and Release ordering (ie. both restrictions apply). I'm not sure when this is necessary - it might be helpful in situations with 3 or more threads if some Release, some Acquire, and some do both, but I'm not really sure.
SeqCst Ordering
This is most restrictive and, therefore, slowest option. It forces memory accesses to appear to occur in one, identical order to every thread. This requires an MFENCE instruction on x86 on all writes to atomic variables (full memory barrier, including StoreLoad), while the weaker orderings don't. (SeqCst loads don't require a barrier on x86, as you can see in this C++ compiler output.)
Read-Modify-Write accesses, like atomic increment, or compare-and-swap, are done on x86 with locked instructions, which are already full memory barriers. If you care at all about compiling to efficient code on non-x86 targets, it makes sense to avoid SeqCst when you can, even for atomic read-modify-write ops. There are cases where it's needed, though.
For more examples of how atomic semantics turn into ASM, see this larger set of simple functions on C++ atomic variables. I know this is a Rust question, but it's supposed to have basically the same API as C++. godbolt can target x86, ARM, ARM64, and PowerPC. Interestingly, ARM64 has load-acquire (ldar) and store-release (stlr) instructions, so it doesn't always have to use separate barrier instructions.
By the way, x86 CPUs are always "strongly ordered" by default, which means they always act as if at least AcqRel mode was set. So for x86 "ordering" only affects how LLVM's optimizer behaves. ARM, on the other hand, is weakly ordered. Relaxed is set by default, to allow the compiler full freedom to reorder things, and to not require extra barrier instructions on weakly-ordered CPUs.

How to understand acquire and release semantics?

I found out three function from MSDN , below:
1.InterlockedDecrement().
2.InterlockedDecrementAcquire().
3.InterlockedDecrementRelease().
I knew those fucntion use to decrement a value as an atomic operation, but i don't know distinction between the three function
(um... but don't ask me what does it mean exactly)
I'll take a stab at that.
Something to remember is that the compiler, or the CPU itself, may reorder memory reads and writes if they appear to not deal with each other.
This is useful, for instance, if you have some code that, maybe is updating a structure:
if ( playerMoved ) {
playerPos.X += dx;
playerPos.Y += dy;
// Keep the player above the world's surface.
if ( playerPos.Z + dz > 0 ) {
playerPos.Z += dz;
}
else {
playerPos.Z = 0;
}
}
Most of above statements may be reordered because there's no data dependency between them, in fact, a superscalar CPU may execute most of those statements simultaneously, or maybe would start working on the Z section sooner, since it doesn't affect X or Y, but might take longer.
Here's the problem with that - lets say that you're attempting lock-free programming. You want to perform a whole bunch of memory writes to, maybe, fill in a shared queue. You signal that you're done by finally writing to a flag.
Well, since that flag appears to have nothing to do with the rest of the work being done, the compiler and the CPU may reorder those instructions, and now you may set your 'done' flag before you've actually committed the rest of the structure to memory, and now your "lock-free" queue doesn't work.
This is where Acquire and Release ordering semantics come into play. I set that I'm doing work by setting a flag or so with an Acquire semantic, and the CPU guarantees that any memory games I play after that instruction stay actually below that instruction. I set that I'm done by setting a flag or so with a Release semantic, and the CPU guarantees that any memory games I had done just before the release actually stay before the release.
Normally, one would do this using explicit locks - mutexes, semaphores, etc, in which the CPU already knows it has to pay attention to memory ordering. The point of attempting to create 'lock free' data structures is to provide data structures that are thread safe (for some meaning of thread safe), that don't use explicit locks (because they are very slow).
Creating lock-free data structures is possible on a CPU or compiler that doesn't support acquire/release ordering semantics, but it usually means that some slower memory ordering semantic is used. For instance, you could issue a full memory barrier - everything that came before this instruction has to actually be committed before this instruction, and everything that came after this instruction has to be committed actually after this instruction. But that might mean that I wait for a bunch of actually irrelevant memory writes from earlier in the instruction stream (perhaps function call prologue) that has nothing to do with the memory safety I'm trying to implement.
Acquire says "only worry about stuff after me". Release says "only worry about stuff before me". Combining those both is a full memory barrier.
http://preshing.com/20120913/acquire-and-release-semantics/
Acquire semantics is a property which can only apply to operations
which read from shared memory, whether they are read-modify-write
operations or plain loads. The operation is then considered a
read-acquire. Acquire semantics prevent memory reordering of the
read-acquire with any read or write operation which follows it in
program order.
Release semantics is a property which can only apply to operations
which write to shared memory, whether they are read-modify-write
operations or plain stores. The operation is then considered a
write-release. Release semantics prevent memory reordering of the
write-release with any read or write operation which precedes it in
program order.
(um... but don't ask me what does it mean exactly)

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.
It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)
The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.
You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.
Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html
I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).
Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.
I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.
I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.
Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.
I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!
If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.
Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.
The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.
If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!
Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.
It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)
This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.
If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?
Understand your worst case runtime for complex operations and use timers.

Resources