Cache miss in an Out of order processor

Cache miss in an Out of order processor - caching

Imagine an application is running on an Out of order processor and it has a lot of last level cache(LLC) misses (more than 70%). Do you think that if we decrease the frequency of the processor and set it to a smaller value then the execution time of the application will increase in a big way or doesn't effect much? Could you please explain your answer
Thanks and regards

As is the case with most micro-architectural features, the safe answer would be - "it might, and it might not - depends on the exact characteristics of your application".
Take for e.g. a workload that runs through a large graph that resides in the memory - each new node needs to be fetched and processed before the new node can be chosen. If you reduce the frequency you would harm the execution phase which is latency critical since the next set of memory accesses depends on it.
On the other hand, a workload that is bandwidth-bounded (i.e.- performs as fast as the system memory BW limits), is probably not fully utilizing the CPU and would therefore not be harmed as much.
Basically the question boils down to how well your application utilizes the CPU, or rather - where between the CPU and the memory, can you find the performance bottleneck.
By the way, note that even if reducing the frequency does impact the execution time, it could still be beneficial for you power/performance ratio, depends where along the power/perf curve you're located and on the exact values.

Related

maxed CPU performance - stack all tasks or aim for less than 100%?

I have 12 tasks to run on an octo-core machine. All tasks are CPU intensive and each will max out a core.
Is there a theoretical reason to avoid stacking tasks on a maxed out core (such as overhead, swapping across tasks) or is it faster to queue everything?

Task switching is a waste of CPU time. Avoid it if you can.
Whatever the scheduler timeslice is set to, the CPU will waste its time every time slice by going into the kernel, saving all the registers, swapping the memory mappings and starting the next task. Then it has to load in all its CPU cache, etc.
Much more efficient to just run one task at a time.
Things are different of course if the tasks use I/O and aren't purely compute bound.

Yes it's called queueing theory https://en.wikipedia.org/wiki/Queueing_theory. There are many different models https://en.wikipedia.org/wiki/Category:Queueing_theory for a range of different problems I'd suggest you scan them and pick the one most applicable to your workload then go and read up on how to avoid the worst outcomes for that model, or pick a different, better, model for dispatching your workload.
Although the graph at this link https://commons.wikimedia.org/wiki/File:StochasticQueueingQueueLength.png applies to Traffic it will give you an idea of what is happening to response times as your CPU utilisation increases. It shows that you'll reach an inflection point after which things get slower and slower.
More work is arriving than can be processed with subsequent work waiting longer and longer until it can be dispatched.
The more cores you have the further to the right you push the inflection point but the faster things go bad after you reach it.
I would also note that unless you've got some really serious cooling in place you are going to cook your CPU. Depending on it's design it will either slow itself down, making your problem worse, or you'll trigger it's thermal overload protection.
So a simplistic design for 8 cores would be, 1 thread to manage things and add tasks to the work queue and 7 threads that are pulling tasks from the work queue. If the tasks need to be performed within a certain time you can add a TimeToLive value so that they can be discarded rather than executed needlessly. As you are almost certainly running your application in an OS that uses a pre-emptive threading model consider things like using processor affinity where possible because as #Zan-Lynx says task/context switching hurts. Be careful not to try to build your OS'es thread management again as you'll probably wind up in conflict with it.

tl;dr: cache thrash is Bad
You have a dozen tasks. Each will have to do a certain amount of work.
At an app level they each processed a thousand customer records or whatever. That is fixed, it is a constant no matter what happens on the hardware.
At the the language level, again it is fixed, C++, java, or python will execute a fixed number of app instructions or bytecodes. We'll gloss over gc overhead here, and page fault and scheduling details.
At the assembly level, again it is fixed, some number of x86 instructions will execute as the app continues to issue new instructions.
But you don't care about how many instructions, you only care about how long it takes to execute those instructions. Many of the instructions are reads which MOV a value from RAM to a register. Think about how long that will take. Your computer has several components to implement the memory hierarchy - which ones will be involved? Will that read hit in L1 cache? In L2? Will it be a miss in last-level cache so you wait (for tens or hundreds of cycles) until RAM delivers that cache line? Did the virtual memory reference miss in RAM, so you wait (for milliseconds) until SSD or Winchester storage can page in the needed frame? You think of your app as issuing N reads, but you might more productively think of it as issuing 0.2 * N cache misses. Running at a different multi-programming level, where you issue 0.3 * N cache misses, could make elapsed time quite noticeably longer.
Every workload is different, and can place larger or smaller demands on memory storage. But every level of the memory hierarchy depends on caching to some extent, and higher multi-programming levels are guaranteed to impact cache hit rates. There are network- and I/O-heavy workloads where very high multi-programming levels absolutely make sense. But for CPU- and memory-intensive workloads, when you benchmark elapsed times you may find that less is more.

Upper bound on speedup

My MPI experience showed that the speedup as does not increase linearly with the number of nodes we use (because of the costs of communication). My experience is similar to this:.
Today a speaker said: "Magically (smiles), in some occasions we can get more speedup than the ideal one!".
He meant that ideally, when we use 4 nodes, we would get a speedup of 4. But in some occasions we can get a speedup greater than 4, with 4 nodes! The topic was related to MPI.
Is this true? If so, can anyone provide a simple example on that? Or maybe he was thinking about adding multithreading to the application (he went out of time and then had to leave ASAP, thus we could not discuss)?

Parallel efficiency (speed-up / number of parallel execution units) over unity is not at all uncommon.
The main reason for that is the total cache size available to the parallel program. With more CPUs (or cores), one has access to more cache memory. At some point, a large portion of the data fits inside the cache and this speeds up the computation considerably. Another way to look at it is that the more CPUs/cores you use, the smaller the portion of the data each one gets, until that portion could actually fit inside the cache of the individual CPU. This is sooner or later cancelled by the communication overhead though.
Also, your data shows the speed-up compared to the execution on a single node. Using OpenMP could remove some of the overhead when using MPI for intranode data exchange and therefore result in better speed-up compared to the pure MPI code.
The problem comes from the incorrectly used term ideal speed-up. Ideally, one would account for cache effects. I would rather use linear instead.

Not too sure this is on-topic here, but here goes nothing...
This super-linearity in speed-up can typically occur when you parallelise your code while distributing the data in memory with MPI. In some cases, by distributing the data across several nodes / processes, you end-up having sufficiently small chunks of data to deal with for each individual process that it fits in the cache of the processor. This cache effect might have a huge impact on the code's performance, leading to great speed-ups and compensating for the increased need of MPI communications... This can be observed in many situations, but this isn't something you can really count for for compensating a poor scalability.
Another case where you can observe this sort of super-linear scalability is when you have an algorithm where you distribute the task of finding a specific element in a large collection: by distributing your work, you can end up in one of the processes/threads finding almost immediately the results, just because it happened to be given range of indexes starting very close to the answer. But this case is even less reliable than the aforementioned cache effect.
Hope that gives you a flavour of what super-linearity is.

Cache has been mentioned, but it's not the only possible reason. For instance you could imagine a parallel program which does not have sufficient memory to store all its data structures at low node counts, but foes at high. Thus at low node counts the programmer may have been forced to write intermediate values to disk and then read them back in again, or alternatively re-calculate the data when required. However at high node counts these games are no longer required and the program can store all its data in memory. Thus super-linear speed-up is a possibility because at higher node counts the code is just doing less work by using the extra memory to avoid I/O or calculations.
Really this is the same as the cache effects noted in the other answers, using extra resources as they become available. And this is really the trick - more nodes doesn't just mean more cores, it also means more of all your resources, so as speed up really measures your core use if you can also use those other extra resources to good effect you can achieve super-linear speed up.

Which are the common causes for non scalability of shared memory programs?

Whenever someone paralelizes an application the expected outcome is a decent speedup, but is not always the case.
It is very usual that a program that runs in x seconds, parallelized to use 8 cores will not achieve x/8 seconds (optimal speedup). In some extreme cases, it even takes more time than the original sequential program.
Why? and most importantly, how do I improve scalability?

There are a few common causes of non scalability:
Too much synchronization: Some problems (and sometimes too much conservative programmers) require lots of synchronization between parallel tasks, this eliminates most of the parallelism in the algorithm, making it slower.
1.1. Make sure to use the minimum synchronization possible for your algorithm. With openmp for instance, a simple change from synchronized to atomic can result in a relevant difference.
1.2 Sometimes a worse sequential algorithm might offer better parallelism opportunities, if you have the chance to try something else it might be worth the shot.
Memory bandwidth limitation: it is very common that the most "trivial" implementation of an algorithm is not optimized for locality, which implies heavy communication costs between the processors and the main memory.
2.1 Optimize for locality: this means get to know where your application will run, what are the available cache memories and how to change your data structures to maximize cache usage.
Too much parallelization overhead: sometimes the parallel task is so "small" that the overhead for thread/process creation is too big compared to the parallel region total time, which causes a poor speedup or even speed-down.

All of RSFalcon7's suggestions can be combined into a "super rule": do as much as possible in unshared resources (L1 & L2 caches) - implying economizing on code and data requirements - and if you need to go to shared resources do as much as possible in L3 before going to RAM before using synchronization (the CPU cycles required to synchronize is variable but is slower - or much slower - than accessing RAM) before going to disks.
If you plan to utilize hyperthreading I have found that code compiled with gcc will utilize hyperthreading better with optimization level O1 than with, say, O2 or O3.

Is there a technique to predict performance impact of application

A customer is running a clustered web application server under considerable load. He wants to know if the upcoming application, which is not implemented yet, will still be manageable by his current setup.
Is there a established method to predict the performance impact of application in concept state, based on an existing requirement specification (or maybe a functional design specification).
First priority would be to predict the impact on CPU resource.
Is it possible to get fairly exact results at all?

I'd say the canonical answer is no. You always have to benchmark the actual application being deployed on its target architecture.
Why? Software and software development are not predictable. And systems are even more unpredictable.
Even if you know the requirements now and have done deep analysis what happens if:
The program has a performance bug (or two...) - which might even be a bug in a third-party library
New requirements are added or requirements change
The analysis and design don't spot all the hidden inter-relationships between components
There are non-linear effects of adding load and the new load might take the hardware over a critical threshold (a threshold that is not obvious now).
These concerns are not theoretical. If they were, SW development would be trivial and projects would always be delivered on time and to budget.
However there are some heuristics I personally used that you can apply. First you need a really good understanding of the current system:
Break the existing system's functions down into small, medium and large and benchmark those on your hardware
Perform a load test of these individual functions and capture thoughput in transactions/sec, CPU cost, network traffic and disk I/O figures for as many of these transactions as possible, making sure you have representation of small, medium and large. This load test should take the system up to the point where additional load will decrease transactions/sec
Get the figures for the max transactions/sec of the current system
Understand the rate of growth of this application and plan accordingly
Perform the analysis to get an 'average' small, medium and large 'cost' in terms of CPU, RAM, disk and network. This would be of the form:
Small transaction
CPU utilization: 10ms
RAM overhead 5MB (cache)
RAM working: 100kb (eg 10 concurrent threads = 1MB, 100 threads = 10MB)
Disk I/O: 5kb (database)
Network app<->DB: 10kb
Network app<->browser: 40kb
From this analysis you should understand how much headroom you have - CPU certainly, but check that there is sufficient RAM, network and disk capacity. Eg, the CPU required for small transactions is number of small transactions per second multiplied by the CPU cost of a small transaction. Add in the CPU cost of medium transactions and large ones, and you have your CPU budget.
Make sure the DBAs are involved. They need to do the same on the DB.
Now you need to analyse your upcoming application:
Assign each features into the same small, medium and large buckets, ensuring a like-for-like matching as far as possible
Ask deep, probing questions about how many transactions/sec each feature will experience at peak
Talk about the expected rate of growth of the application
Don't forget that the system may slow as the size of the database increases
On a personal note, you are being asked to predict the unpredictable - putting your name and reputation on the line. If you say it can fit, you are owning the risk for a large software development project. If you are being pressured to say yes, you need to ensure that there are many other people's names involved along with yours - and those names should all be visible on the go/no-go decision. Not only is this more likely to ensure that all factors are considered, and that the analysis is sound, but it will also ensure that the project has many involved individuals personally aligned to its success.

Software performance (MCPS and Power consumed) in a Embedded system

Assume an embedded environment which has either a DSP core(any other processor core).
If i have a code for some application/functionality which is optimized to be one of the best from point of view of Cycles consumed(MCPS) , will it also be a code, best from the point of view of Power consumed by that code in a real hardware system?
Can a code optimized for least MCPS be guaranteed to have least power consumption as well?
I know there are many aspects to be considered here like the architecture of the underlying processor and the hardware system(memory, bus, etc..).

Very difficult to tell without putting a sensitive ammeter between your board and power supply and logging the current drawn. My approach is to test assumptions for various real world scenarios rather than go with the supporting documentation.

No, lowest cycle count will not guarantee lowest power consumption.
It's a good indication, but you didn't take into account that memory bus activity consumes quite a lot of power as well.
Your code may for example have a higher cycle count but lower power consumption if you move often needed data into internal memory (on chip ram). That won't increase the cycle-count of your algorithms but moving the data in- and out the internal memory increases cycle-count.
If your system has a cache as well as internal memory, optimize for best cache utilization as well.

This isn't a direct answer, but I thought this paper (from this answer) was interesting: Real-Time Task Scheduling for Energy-Aware Embedded Systems.
As I understand it, it trying to run each task under the processor's low power state, unless it can't meet the deadline without high power. So in a scheme like that, more time efficient code (less cycles) should allow the processor to spend more time throttled back.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio