"Well-parallelized" algorithm not sped up by multiple threads - parallel-processing

I'm sorry to ask a question one a topic that I know so little about, but this idea has really been bugging me and I haven't been able to find any answers on the internet.
Background:
I was talking to one of my friends who is in computer science research. I'm in mostly ad-hoc development, so my understanding of a majority of CS concepts is at a functional level (I know how to use them rather than how they work). He was saying that converting a "well-parallelized" algorithm that had been running on a single thread into one that ran on multiple threads didn't result in the processing speed increase that he was expecting.
Reasoning:
I asked him what the architecture of the computer he was running this algorithm on was, and he said 16-core (non-virtualized). According to what I know about multi-core processors, the processing speed increase of an algorithm running on multiple cores should be roughly proportional to how well it is parallelized.
Question:
How can an algorithm that is "well-parallelized" and programmed correctly to run on a true multi-core processor not run several times more quickly? Is there some information that I'm missing here, or is it more likely a problem with the implementation?
Other stuff: I asked if the threads were possibly taking up more power than any individual core had available and apparently each core runs at 3.4 GHz. This is much more than the algorithm should need, and when diagnostics are run the cores aren't maxed out during runtime.

It is likely sharing something. What is being shared may not be obvious.
One of the most common non-obvious shared resources is CPU cache. If the threads are updating the same cache line that cache line has to bounce between CPUs, slowing everything down.
That can happen because of accessing (even read-only) variables which are near to each other in memory. If all accesses are read-only it is OK, but if even one CPU is writing to that cache line it will force a bounce.
A brute-force method of fixing this is to put shared variables into structures that look like:
struct var_struct {
int value;
char padding[128];
};
Instead of hard-coding 128 you could research what system parameter or preprocessor macros define the cache-line size for your system type.
Another place that sharing can take place is inside system calls. Even seemingly innocent functions might be taking global locks. I seem to recall reading about Linux fixing an issue like this a while back with locks on the functions that return process and thread identifiers and parent identifiers.

Performance versus number of cores is often a S-like curve - first it obviously increases but as locking, shared cache and the like take they debt the further cores do not add so much and even may degrade. Hence nothing mysterious. If we would know more details about the algorithm it may be possible to find an idea to speed it up.

Related

many-core CPU's: Programming techniques to avoid disappointing scalability

We've just bought a 32-core Opteron machine, and the speedups we get are a little disappointing: beyond about 24 threads we see no speedup at all (actually gets slower overall) and after about 6 threads it becomes significantly sub-linear.
Our application is very thread-friendly: our job breaks down into about 170,000 little tasks which can each be executed separately, each taking 5-10 seconds. They all read from the same memory-mapped file of size about 4Gb. They make occasional writes to it, but it might be 10,000 reads to each write - we just write a little bit of data at the end of each of the 170,000 tasks. The writes are lock-protected. Profiling shows that the locks are not a problem. The threads use a lot of JVM memory each in non-shared objects and they make very little access to shared JVM objects and of that, only a small percentage of accesses involve writes.
We're programming in Java, on Linux, with NUMA enabled. We have 128Gb RAM. We have 2 Opteron CPU's (model 6274) of 16 cores each. Each CPU has 2 NUMA nodes. The same job running on an Intel quad-core (i.e. 8 cores) scaled nearly linearly up to 8 threads.
We've tried replicating the read-only data to have one-per-thread, in the hope that most lookups can be local to a NUMA node, but we observed no speedup from this.
With 32 threads, 'top' shows the CPU's 74% "us" (user) and about 23% "id" (idle). But there are no sleeps and almost no disk i/o. With 24 threads we get 83% CPU usage. I'm not sure how to interpret 'idle' state - does this mean 'waiting for memory controller'?
We tried turning NUMA on and off (I'm referring to the Linux-level setting that requires a reboot) and saw no difference. When NUMA was enabled, 'numastat' showed only about 5% of 'allocation and access misses' (95% of cache misses were local to the NUMA node). [Edit:] But adding "-XX:+useNUMA" as a java commandline flag gave us a 10% boost.
One theory we have is that we're maxing out the memory controllers, because our application uses a lot of RAM and we think there are a lot of cache misses.
What can we do to either (a) speed up our program to approach linear scalability, or (b) diagnose what's happening?
Also: (c) how do I interpret the 'top' result - does 'idle' mean 'blocked on memory controllers'? and (d) is there any difference in the characteristics of Opteron vs Xeon's?
I also have a 32 core Opteron machine, with 8 NUMA nodes (4x6128 processors, Mangy Cours, not Bulldozer), and I have faced similar issues.
I think the answer to your problem is hinted at by the 2.3% "sys" time shown in top. In my experience, this sys time is the time the system spends in the kernel waiting for a lock. When a thread can't get a lock it then sits idle until it makes its next attempt. Both the sys and idle time are a direct result of lock contention. You say that your profiler is not showing locks to be the problem. My guess is that for some reason the code causing the lock in question is not included in the profile results.
In my case a significant cause of lock contention was not the processing I was actually doing but the work scheduler that was handing out the individual pieces of work to each thread. This code used locks to keep track of which thread was doing which piece of work. My solution to this problem was to rewrite my work scheduler avoiding mutexes, which I have read do not scale well beyond 8-12 cores, and instead use gcc builtin atomics (I program in C on Linux). Atomic operations are effectively a very fine grained lock that scales much better with high core counts. In your case if your work parcels really do take 5-10s each it seems unlikely this will be significant for you.
I also had problems with malloc, which suffers horrible lock issues in high core count situations, but I can't, off the top of my head, remember whether this also led to sys & idle figures in top, or whether it just showed up using Mike Dunlavey's debugger profiling method (How can I profile C++ code running in Linux?). I suspect it did cause sys & idle problems, but I draw the line at digging through all my old notes to find out :) I do know that I now avoid runtime mallocs as much as possible.
My best guess is that some piece of library code you are using implements locks without your knowledge, is not included in your profiling results, and is not scaling well to high core-count situations. Beware memory allocators!
I'm sure the answer will lie in a consideration of the hardware architecture. You have to think of multi core computers as if they were individual machines connected by a network. In fact that's all that Hypertransport and QPI are.
I find that to solve these scalability problems you have to stop thinking in terms of shared memory and start adopting the philosophy of Communicating Sequential Processes. It means thinking very differently, ie imagine how you would write the software if your hardware was 32 single core machines connected by a network. Modern (and ancient) CPU architectures are not designed to give unfettered scaling of the sort you're after. They are designed to allow many different processes to get on with processing their own data.
Like everything else in computing these things go in fashions. CSP dates back to the 1970s, but the very modern and Java derived Scala is a popular embodiment of the concept. See this section on Scala concurrency on Wikipedia.
What the philosophy of CSP does is force you to design a data distribution scheme that fits your data and the problem you're solving. That's not necessarily easy, but if you manage it then you have a solution that will scale very well indeed. Scala may make it easier to develop.
Personally I do everything in CSP and in C. It's allowed me to develop a signal processing application that scales perfectly linearly from 8 cores to several thousand cores (the limit being how big my room is).
The first thing you're going to have to do is actually use NUMA. It isn't a magic setting that you turn on, you have to exploit it in your software's architecture. I don't know about Java, but in C one would bind a memory allocation to a specific core's memory controller (aka memory affinity), and similarly for threads (core affinity) in cases where the OS doesn't get the hint.
I presume that your data doesn't break down into 32 neat, discrete chunks? It's difficult to give advice without knowing exactly the data flows implicit in your program. But think about it in terms of data flow. Draw it out even; Data Flow Diagrams are useful for this (another ancient graphical formal notation). If your picture shows all your data going through a single object (eg through a single memory buffer) then it's going to be slow...
I assume you have optimized your locks, and synchronization made a minimum. In such a case, it still depends a lot on what libraries you are using to program in parallel.
One issue that can happen even if you have no synchronization issue, is memory bus congestion. This is very nasty and difficult to get rid of.
All I can suggest is somehow make your tasks bigger and create fewer tasks. This depends highly on the nature of your problem. Ideally you want as many tasks as the number of cores/threads, but this is not easy (if possible) to achieve.
Something else that can help is to give more heap to your JVM. This will reduce the need to run Garbage Collector frequently, and speeds up a little.
does 'idle' mean 'blocked on memory controllers'
No. You don't see that in top. I mean if the CPU is waiting for memory access, it will be shown as busy. If you have idle periods, it is either waiting for a lock, or for IO.
I'm the Original Poster. We think we've diagnosed the issue, and it's not locks, not system calls, not memory bus congestion; we think it's level 2/3 CPU cache contention.
To reiterate, our task is embarrassingly parallel so it should scale well. However, one thread has a large amount of CPU cache it can access, but as we add more threads, the amount of CPU cache each process can access gets lower and lower (the same amount of cache divided by more processes). Some levels on some architectures are shared between cores on a die, some are even shared between dies (I think), and it may help to get "down in the weeds" with the specific machine you're using, and optimise your algorithms, but our conclusion is that there's not a lot we can do to achieve the scalability we thought we'd get.
We identified this as the cause by using 2 different algorithms. The one which accesses more level 2/3 cache scales much worse than the one which does more processing with less data. They both make frequent accesses to the main data in main memory.
If you haven't tried that yet: Look at hardware-level profilers like Oracle Studio has (for CentOS, Redhat, and Oracle Linux) or if you are stuck with Windows: Intel VTune. Then start looking at operations with suspiciously high clocks per instruction metrics. Suspiciously high mean a lot higher than the same code on a single-numa, single-L3-cache machine (like current Intel desktop CPUs).

Performing numerical calculations in a virtualized setting?

I'm wondering what kind of performance hit numerical calculations will have in a virtualized setting? More specifically, what kind of performance loss can I expect from running CPU-bound C++ code in a virtualized windows OS as opposed to a native Linux one, on rather fast x86_64 multi-core machines?
I'll be happy to add precisions as needed, but as I don't know much about virtualization, I don't know what info is relevant.
Processes are just bunches of threads which are streams of instructions executing in a sequential fashion. In modern virtualisation solutions, as far as the CPU is concerned, the host and the guest processes execute together and differ only in that the I/O of the latter is being trapped and virtualised. Memory is also virtualised but that occurs more or less in the hardware MMU. Guest instructions are directly executed by the CPU otherwise it would not be virtualisation but rather emulation and as long as they do not access any virtualised resources they would execute just as fast as the host instructions. At the end it all depends on how well the CPU could cope with the increased number of running processes.
There are lightweight virtualisation solutions like zones in Solaris that partition the process space in order to give the appearance of multiple copies of the OS but it all happens under the umbrella of a single OS kernel.
The performance hit for pure computational codes is very small, often under 1-2%. The catch is that in reality all programs read and write data and computational codes usually read and write lots of data. Virtualised I/O is usually much slower than direct I/O even with solutions like Intel VT-* or AMD-V.
Exact numbers depend heavily on the specific hardware.
Goaded by #Mitch Wheat's unarguable assertion that my original post here was not an answer, here's an attempt to recast it as an answer:
I work mostly on HPC in the energy sector. Some of the computations that my scientist colleagues run take O(10^5) CPU-hours, we're seriously thinking about O(10^6) CPU-hours jobs in the near future.
I get well paid to squeeze every last drop of performance out of our codes, I'd think it was a good day's work if I could knock 1% off the run-time of some of our programs. Sometimes it has taken me a month to get that sort of performance improvement, sure I may be slow, but it's still cost-effective for our scientists.
I shudder therefore, when bright salespeople offering the latest and best in data center software (of which virtualization is one aspect) which will only, as I see it, shackle my codes to a pile of anchor chain from a 250,00dwt tanker (that was a metaphor).
I have read the question carefully and understand that OP is not proposing that virtualization would help, I'm offering the perspective of a practitioner. If this is still too much of a comment, do the SO thing and vote to close, I promise I won't be offended !

Are there practical limits to the number of cores accessing the same memory?

Will the current trend of adding cores to computers continue? Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Put another way: is the high powered desktop computer of the future apt to have 1024 cores using one set of memory, or is it apt to have 32 sets of memory, each accessed by 32 cores?
Or still another way: I have a multi-threaded program that runs well on a 4-core machine, using a significant amount of the total CPU. As this program grows in size and does more work, can I be reasonably confident more powerful machines will be available to run it? Or should I be thinking seriously about running multiple sessions on multiple machines (or at any rate multiple sets of memory) to get the work done?
In other words, is a purely multithreaded approach to design going to leave me in a dead end? (As using a single threaded approach and depending on continued improvements in CPU speed years back would have done?) The program is unlikely to be run on a machine costing more than, say $3,000. If that machine cannot do the work, the work won't get done. But if that $3,000 machine is actually a network of 32 independent computers (though they may share the same cooling fan) and I've continued my massively multithreaded approach, the machine will be able to do the work, but the program won't, and I'm going to be in an awkward spot.
Distributed processing looks like a bigger pain than multithreading was, but if that might be in my future, I'd like some warning.
Will the current trend of adding cores to computers continue?
Yes, the GHz race is over. It's not practical to ramp the speed any more on the current technology. Physics has gotten in the way. There may be a dramatic shift in the technology of fabricating chips that allows us to get round this, but it's not obviously 'just around the corner'.
If we can't have faster cores, the only way to get more power is to have more cores.
Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Absolutely there's a limit. In a shared memory system the memory is a shared resource and has a limited amount of bandwidth.
Max processes = (Memory Bandwidth) / (Bandwidth required per process)
Now - that 'Bandwidth per process' figure will be reduced by caches, but caches become less efficient if they have to be coherent with one another because everyone is accessing the same area of memory. (You can't cache a memory write if another CPU may need what you've written)
When you start talking about huge systems, shared resources like this become the main problem. It might be memory bandwidth, CPU cycles, hard drive access, network bandwidth. It comes down to how the system as a whole is structured.
You seem to be really asking for a vision of the future so you can prepare. Here's my take.
I think we're going to see a change in the way software developers see parallelism in their programs. At the moment, I would say that a lot of software developers see the only way of using multiple threads is to have lots of them doing the same thing. The trouble is they're all contesting for the same resources. This then means lots of locking needs to be introduced, which causes performance issues, and subtle bugs which are infuriating and time consuming to solve.
This isn't sustainable.
Manufacturing worked out at the beginning of the 20th Century, the fastest way to build lots of cars wasn't to have lots of people working on one car, and then, when that one's done, move them all on to the next car. It was to split the process of building the car down into lots of small jobs, with the output of one job feeding the next. They called it assembly lines. In hardware design it's called pipe-lining, and I'll think we'll see software designs move to it more and more, as it minimizes the problem of shared resources.
Sure - There's still a shared resource on the output of one stage and the input of the next, but this is only between two threads/processes and is much easier to handle. Standard methods can also be adopted on how these interfaces are made, and message queueing libraries seem to be making big strides here.
There's not one solution for all problems though. This type of pipe-line works great for high throughput applications that can absorb some latency. If you can't live with the latency, you have no option but to go the 'many workers on a single task' route. Those are the ones you ideally want to be throwing at SIMD machines/Array processors like GPUs, but it only really excels with a certain type of problem. Those problems are ones where there's lots of data to process in the same way, and there's very little or no dependency between data items.
Having a good grasp of message queuing techniques and similar for pipelined systems, and utilising fine grained parallelism on GPUs through libraries such as OpenCL, will give you insight at both ends of the spectrum.
Update: Multi-threaded code may run on clustered machines, so this issue may not be as critical as I thought.
I was carefully checking out the Java Memory Model in the JLS, chapter 17, and found it does not mirror the typical register-cache-main memory model of most computers. There were opportunities there for a multi-memory machine to cleanly shift data from one memory to another (and from one thread running on one machine to another running on a different one). So I started searching for JVMs that would run across multiple machines. I found several old references--the idea has been out there, but not followed through. However, one company, Terracotta, seems to have something, if I'm reading their PR right.
At any rate, it rather seems that when PC's typically contain several clustered machines, there's likely to be a multi-machine JVM for them.
I could find nothing outside the Java world, but Microsoft's CLR ought to provide the same opportunities. C and C++ and all the other .exe languages might be more difficult. However, Terracotta's websites talk more about linking JVM's rather than one JVM on multiple machines, so their tricks might work for executable langauges also (and maybe the CLR, if needed).

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages
One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.
One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.
Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.
More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.
You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.
What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.
Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.
You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.
I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.
I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.
You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.
Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.
If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!
Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.
It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)
This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.
If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?
Understand your worst case runtime for complex operations and use timers.

Resources