Confused about processing power and run time [closed] - performance

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I have been doing scientific computing on C, Python and Matlab. When I run a piece of code on a desktop PC, it might take hours to complete. However, during this time, less than 100% of CPU and less than 100% of memory is used.
Where is the bottleneck then? Naive question: Why can't the PC throw more processing power at the algorithm to make it run faster?
Edit
In particular, I am currently running a vectorized loop (that does not do any I/O) in Matlab that has been going on for 2 hours and task manager says 40-38% CPU usage (and 28% memory) all these time. Why doesn't the PC use 90% CPU instead and does this faster?

Are you doing any I/O? Are you running any other processes?
At any instant, the computer is either running your program (100%) or something else (0%), so what you see is a time-average.
If you do any I/O, your program has to wait while it happens, and that comes off of the 100%.
As far as memory, your program uses what it uses, which may or may not be all the RAM available.
BTW, just because it's using 100% of the CPU doesn't mean it's being fast. Your Python and Matlab code is likely to require 10-100 cycles to do the same thing C does in one cycle, simply because those languages are interpreted and/or do a lot more memory management.

Try changing the process's priority (or processes if there is more than one used). Depending upon the OS, you can practically cause the entire CPU to be dedicated to your process. Just be sure you understand the implications. I have done this for testing.

Now majority of CPUs feature multiple cores and the program may run faster if executed on multiple threads. However the languages you list do not support multi-threading very easily (while in C you may try with pthreads or MPI).
A fast solution may be simply run two or three instances of your program at the same time, if you need to try different input data or algorithm versions, for instance. It seems also you have enough memory for this.

Related

(Un)Limit CPU usage - Windows - Mathematica

I'm programming in Mathematica 8.
When I run my programme, I check with Win8 task manager that the CPU usage is at 35% as soon as it starts to run, and my memory usage also increases to 44%. Does Win8 limit the amount of CPU usage that a certain programme may have? I need to make my computer to use more of its resources to run the programme faster.
Any help would be appreciated.
What's happening here is a common misconception about how processors approach problems involving heavy computation.
Although you may indeed have a powerful 4-core processing machine, and you have a program which is capable of using all 4 processing cores (which mathematica definitely is!), unless the code is written in a parallel fashion, you will only be able to use 1 core at a time to do the calculations. As Mysticial mentioned in the comment, not all code is parallelizable, in fact, I'd say a great many problems are not inherently able to be parallelized.
Check here for some good examples of problems that can be split up in a parallel fashion well. Now, your memory usage will simply increase with the size of the data that's being stored temporarily. (ex: storing a 69X69 matrix takes up less memory (RAM) than a 4000X4000, being parallel has little to do with this, and more with the problem itself).
Anyway, tl;dr, your computer is acting normally. To use all 100% of that 4-core machine you're using, check out This mathematica reference guide to parallel operations.

Performance Gain with Specifically Built OS [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
Let's take a trivial CPU bound program, such as brute forcing prime numbers, which perhaps occasionally saves them to an SD card.
Inefficiencies in today's programs include interpretation and virtual machines etc. So in the interest of speed, let's throw those away, and use a compiled language.
Now while we now have code that can run directly on the processor, we still have the operating system, which will multiplex between different processes, run its own code, manage memory and do other things that will slow down the execution of our program.
If we were to write our own operating system which solely runs our program, what factor of speedup could we expect to see?
I'm sure there might be a number of variables, so please elaborate if you want.
Take a look at products by Return Infinity http://www.returninfinity.com/ (I'm not affiliated in any way), and experiment.
My own supercomputing experience demonstrates, that skipping the TLB (almost entirely), by running a flat memory model, combined with lack of context switching between kernel and userland, can and does accelerate some tasks - especially those related to message passing in networking (MAC level, not even TCP, why bother), as well as brute force computation (due to lack of memory management).
On brute-force computation that exceeds the TLB or cache size, you can expect approx 5-15% performance gain compared to having to do RAM-based translation table lookups - the penalty is that each software error is entirely unguarded (you can lock some pages statically with monolithic linking, thou).
On high-bandwidth work, especially with a lot of small message-passing, you can easily obtain even 500% acceleration by going kernel-space, either by completely removing the (multi-tasking) OS, or by loading your application as a kernel driver, circumventing the entire abstraction as well. We've been able to push the network latency on MAC-layer pings from 18us down to 1.3us.
On computation that does fit inside L1 cache, I'd expect minimal improvement (around 1%).
Does it all matter? Yes and no. If your hardware costs vastly exceed your engineering costs, and you have done all the algorithmic improvements you can think of (better yet, proved that the computation done is exactly the computation required for the result!) - this can give meaningful perfomance benefits. Extra 3% (overall average success) on a supercomputer costing approx $8M/y in electricity, not including hardware amortization, is worth $24k/y. Enough to pay an engineer for a month to optimize the most common task it runs :).
Assuming you're running a decent machine and the OS is not doing much else: Not a large factor, I'd expect less than a 10% improvement.
Just the OS 'idling' doesn't (shouldn't) take up much of the processing power of the CPU. If it is, you need a better machine, a better OS, a format or some combination of these.
If, on the other hand, you're running a bunch of other resource-intensive things, obviously expect that this can be sped up a lot by just not running those other things.
If you're not a super-user, you may be surprised to find that there are a ton of (non-OS) processes running in the background, these are more likely to take up CPU processing power that the OS.
Slightly off topic but related, keep in mind that, if you're running 8 cores, you can, in a perfect world, speed up the process by 8x by multi-threading.
Expect a way bigger improvement from known solutions to problems and making better use of data structures and algorithms, and, to a lesser extent, the choice of language and micro-optimizations.
From my experience:
Not the most scientific or trustable result but, most of the time when I open up Task Manager on Windows, all the OS processes are below 1% of the CPU.
There is a super-computer answer, and a multi-cores answer already, so here is the GPGPU answer.
When a super-computer is overkill, but a multi-core CPU is under-powered, and your algorithm is sensibly parallelizable, consider adapting it to a GPGPU. Many of the benefits of a super-computer solution are available, in reduced form at reduced cost, by performing CPU-intensive tasks on a GPGPU.
Here is a link to an analysis I performed last year on implementing, and tuning, a brute-force solution to the Travelling Salesman Problem using a compute capability 2.0 NVIDIA Graphics card, CUDAfy, and C#.

"Well-parallelized" algorithm not sped up by multiple threads

I'm sorry to ask a question one a topic that I know so little about, but this idea has really been bugging me and I haven't been able to find any answers on the internet.
Background:
I was talking to one of my friends who is in computer science research. I'm in mostly ad-hoc development, so my understanding of a majority of CS concepts is at a functional level (I know how to use them rather than how they work). He was saying that converting a "well-parallelized" algorithm that had been running on a single thread into one that ran on multiple threads didn't result in the processing speed increase that he was expecting.
Reasoning:
I asked him what the architecture of the computer he was running this algorithm on was, and he said 16-core (non-virtualized). According to what I know about multi-core processors, the processing speed increase of an algorithm running on multiple cores should be roughly proportional to how well it is parallelized.
Question:
How can an algorithm that is "well-parallelized" and programmed correctly to run on a true multi-core processor not run several times more quickly? Is there some information that I'm missing here, or is it more likely a problem with the implementation?
Other stuff: I asked if the threads were possibly taking up more power than any individual core had available and apparently each core runs at 3.4 GHz. This is much more than the algorithm should need, and when diagnostics are run the cores aren't maxed out during runtime.
It is likely sharing something. What is being shared may not be obvious.
One of the most common non-obvious shared resources is CPU cache. If the threads are updating the same cache line that cache line has to bounce between CPUs, slowing everything down.
That can happen because of accessing (even read-only) variables which are near to each other in memory. If all accesses are read-only it is OK, but if even one CPU is writing to that cache line it will force a bounce.
A brute-force method of fixing this is to put shared variables into structures that look like:
struct var_struct {
int value;
char padding[128];
};
Instead of hard-coding 128 you could research what system parameter or preprocessor macros define the cache-line size for your system type.
Another place that sharing can take place is inside system calls. Even seemingly innocent functions might be taking global locks. I seem to recall reading about Linux fixing an issue like this a while back with locks on the functions that return process and thread identifiers and parent identifiers.
Performance versus number of cores is often a S-like curve - first it obviously increases but as locking, shared cache and the like take they debt the further cores do not add so much and even may degrade. Hence nothing mysterious. If we would know more details about the algorithm it may be possible to find an idea to speed it up.

Are there practical limits to the number of cores accessing the same memory?

Will the current trend of adding cores to computers continue? Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Put another way: is the high powered desktop computer of the future apt to have 1024 cores using one set of memory, or is it apt to have 32 sets of memory, each accessed by 32 cores?
Or still another way: I have a multi-threaded program that runs well on a 4-core machine, using a significant amount of the total CPU. As this program grows in size and does more work, can I be reasonably confident more powerful machines will be available to run it? Or should I be thinking seriously about running multiple sessions on multiple machines (or at any rate multiple sets of memory) to get the work done?
In other words, is a purely multithreaded approach to design going to leave me in a dead end? (As using a single threaded approach and depending on continued improvements in CPU speed years back would have done?) The program is unlikely to be run on a machine costing more than, say $3,000. If that machine cannot do the work, the work won't get done. But if that $3,000 machine is actually a network of 32 independent computers (though they may share the same cooling fan) and I've continued my massively multithreaded approach, the machine will be able to do the work, but the program won't, and I'm going to be in an awkward spot.
Distributed processing looks like a bigger pain than multithreading was, but if that might be in my future, I'd like some warning.
Will the current trend of adding cores to computers continue?
Yes, the GHz race is over. It's not practical to ramp the speed any more on the current technology. Physics has gotten in the way. There may be a dramatic shift in the technology of fabricating chips that allows us to get round this, but it's not obviously 'just around the corner'.
If we can't have faster cores, the only way to get more power is to have more cores.
Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Absolutely there's a limit. In a shared memory system the memory is a shared resource and has a limited amount of bandwidth.
Max processes = (Memory Bandwidth) / (Bandwidth required per process)
Now - that 'Bandwidth per process' figure will be reduced by caches, but caches become less efficient if they have to be coherent with one another because everyone is accessing the same area of memory. (You can't cache a memory write if another CPU may need what you've written)
When you start talking about huge systems, shared resources like this become the main problem. It might be memory bandwidth, CPU cycles, hard drive access, network bandwidth. It comes down to how the system as a whole is structured.
You seem to be really asking for a vision of the future so you can prepare. Here's my take.
I think we're going to see a change in the way software developers see parallelism in their programs. At the moment, I would say that a lot of software developers see the only way of using multiple threads is to have lots of them doing the same thing. The trouble is they're all contesting for the same resources. This then means lots of locking needs to be introduced, which causes performance issues, and subtle bugs which are infuriating and time consuming to solve.
This isn't sustainable.
Manufacturing worked out at the beginning of the 20th Century, the fastest way to build lots of cars wasn't to have lots of people working on one car, and then, when that one's done, move them all on to the next car. It was to split the process of building the car down into lots of small jobs, with the output of one job feeding the next. They called it assembly lines. In hardware design it's called pipe-lining, and I'll think we'll see software designs move to it more and more, as it minimizes the problem of shared resources.
Sure - There's still a shared resource on the output of one stage and the input of the next, but this is only between two threads/processes and is much easier to handle. Standard methods can also be adopted on how these interfaces are made, and message queueing libraries seem to be making big strides here.
There's not one solution for all problems though. This type of pipe-line works great for high throughput applications that can absorb some latency. If you can't live with the latency, you have no option but to go the 'many workers on a single task' route. Those are the ones you ideally want to be throwing at SIMD machines/Array processors like GPUs, but it only really excels with a certain type of problem. Those problems are ones where there's lots of data to process in the same way, and there's very little or no dependency between data items.
Having a good grasp of message queuing techniques and similar for pipelined systems, and utilising fine grained parallelism on GPUs through libraries such as OpenCL, will give you insight at both ends of the spectrum.
Update: Multi-threaded code may run on clustered machines, so this issue may not be as critical as I thought.
I was carefully checking out the Java Memory Model in the JLS, chapter 17, and found it does not mirror the typical register-cache-main memory model of most computers. There were opportunities there for a multi-memory machine to cleanly shift data from one memory to another (and from one thread running on one machine to another running on a different one). So I started searching for JVMs that would run across multiple machines. I found several old references--the idea has been out there, but not followed through. However, one company, Terracotta, seems to have something, if I'm reading their PR right.
At any rate, it rather seems that when PC's typically contain several clustered machines, there's likely to be a multi-machine JVM for them.
I could find nothing outside the Java world, but Microsoft's CLR ought to provide the same opportunities. C and C++ and all the other .exe languages might be more difficult. However, Terracotta's websites talk more about linking JVM's rather than one JVM on multiple machines, so their tricks might work for executable langauges also (and maybe the CLR, if needed).

cpu vs gpu - when cpu is better [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I know many examples when GPU is much faster than CPU. But exists algorithms (problems) which are very hard to parallelise. Could you give me some examples or tests when CPU can overcome GPU ?
Edit:
Thanks for suggestions! We can make a comparison between the most popular and the newest cpu's and gpu's, for example Core i5 2500k vs GeForce GTX 560 Ti.
I wonder how to compare SIMD model between them. For example: Cuda calls a SIMD model more precisely SIMT. But SIMT should be compared to the multhitreading on CPU's which is distributing threads (tasks) between MIMD cores (Core i5 2500k give as 4 MIMD cores). On the other hand each of these MIMD cores can implement SIMD model, but this is something else than SIMT and I don't know how to compare them. Finally a fermi architecture with concurrent kernel execution might be consider as MIMD cores with SIMT.
Based on my experience, I will summarize the key differences in terms of performance between parallel programs in CPUs and GPUs. Trust me, a comparison can be changed from generation to generation. So I will just point out what is good and is bad for CPUs and GPUs. Of course, if you make a program at the extreme, i.e., having only bad or good sides, it will run definitely faster on one platform. But a mixture of those requires very complicated reasoning.
Host program level
One key difference is memory transfer cost. GPU devices requires some memory transfers. This cost is non-trivial in some cases, for example when you have to frequently transfer some big arrays. In my experience, this cost can be minimized but pushing most of host code to device code. The only cases you can do so are when you have to interact with the host operating system in program, such as outputting to monitor.
Device program level
Now we come to see a complex picture that hasn't been fully revealed yet. What I mean is there are many mysterious scenes in GPUs that haven't been disclosed. But still, we have a lot of distinguish CPU and GPU (kernel code) in terms of performance.
There are few factors that I noticed those dramatically contribute to the difference.
Workload distribution
GPUs, which consist of many execution units, are designed to handle massively parallel programs. If you have little of work, say a few sequential tasks, and put these tasks on a GPU, only a few of those many execution units are busy, thus will be slower than CPU. Because CPUs are, in other hand, better to handle short and sequential tasks. The reason is simple, CPUs are much more complicated and able to exploit instruction level parallelism, whereas GPUs exploit thread level parallelism. Well, I heard NVIDIA GF104 can do Superscalar, but I had no chance to experience with it though.
It is worth noting that, in GPUs, workload are divided into small blocks (or workgroups in OpenCL), and blocks are arranged in chunks, each of which is executed in one Streaming processor (I am using terminologies from NVIDIA). But in CPUs, those blocks are executed sequentially - I can't think of anything else than a single loop.
Thus, for programs that have small number of blocks, it will be likely to run faster on CPUs.
Control flow instructions
Branches are bad things to GPUs, always. Please bear in mind that GPUs prefer equal things. Equal blocks, equal threads within a blocks, and equal threads within a warp. But what matters the most?
***Branch divergences.***
Cuda/OpenCL programmers hate branch divergences. Since all the threads somehow are divided into sets of 32 threads, called a warp, and all threads within a warp execute in lockstep, a branch divergence will cause some threads in the warp to be serialized. Thus, the execution time of the warp will be accordingly multiplied.
Unlike GPUs, each cores in CPUs can follow their own path. Furthermore, branches can be efficiently executed because CPUs have branch prediction.
Thus, programs that have more warp divergences are likely to run faster on CPUs.
Memory access instructions
This REALLY is complicated enough so let's make it brief.
Remember that global memory accesses have very high latency (400-800 cycles). So in old generations of GPUs, whether memory accesses are coalesced was a critical matter. Now your GTX560 (Fermi) has more 2 level of caches. So global memory access cost can be reduced in many cases. However, caches in CPUs and GPUs are different, so their effects are also different.
What I can say is that it really really depends on your memory access pattern, your kernel code pattern (how memory accesses are interleaved with computation, the types of operations, etc., ) to tell if one runs faster on GPUs or CPUs.
But somehow you can expect a huge number of cache misses (in GPUs) has a very bad effect on GPUs (how bad? - it depends on your code).
Additionally, shared memory is an important feature of GPUs. Accessing to shared memory is as fast as accessing to GPU L1 cache. So kernels that make use of shared memory will have pretty much benefit.
Some other factors I haven't really mentioned but those can have big impact on the performance in many cases such as bank conflicts, size of memory transaction, GPU occupancy...

Resources