It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
Let's take a trivial CPU bound program, such as brute forcing prime numbers, which perhaps occasionally saves them to an SD card.
Inefficiencies in today's programs include interpretation and virtual machines etc. So in the interest of speed, let's throw those away, and use a compiled language.
Now while we now have code that can run directly on the processor, we still have the operating system, which will multiplex between different processes, run its own code, manage memory and do other things that will slow down the execution of our program.
If we were to write our own operating system which solely runs our program, what factor of speedup could we expect to see?
I'm sure there might be a number of variables, so please elaborate if you want.
Take a look at products by Return Infinity http://www.returninfinity.com/ (I'm not affiliated in any way), and experiment.
My own supercomputing experience demonstrates, that skipping the TLB (almost entirely), by running a flat memory model, combined with lack of context switching between kernel and userland, can and does accelerate some tasks - especially those related to message passing in networking (MAC level, not even TCP, why bother), as well as brute force computation (due to lack of memory management).
On brute-force computation that exceeds the TLB or cache size, you can expect approx 5-15% performance gain compared to having to do RAM-based translation table lookups - the penalty is that each software error is entirely unguarded (you can lock some pages statically with monolithic linking, thou).
On high-bandwidth work, especially with a lot of small message-passing, you can easily obtain even 500% acceleration by going kernel-space, either by completely removing the (multi-tasking) OS, or by loading your application as a kernel driver, circumventing the entire abstraction as well. We've been able to push the network latency on MAC-layer pings from 18us down to 1.3us.
On computation that does fit inside L1 cache, I'd expect minimal improvement (around 1%).
Does it all matter? Yes and no. If your hardware costs vastly exceed your engineering costs, and you have done all the algorithmic improvements you can think of (better yet, proved that the computation done is exactly the computation required for the result!) - this can give meaningful perfomance benefits. Extra 3% (overall average success) on a supercomputer costing approx $8M/y in electricity, not including hardware amortization, is worth $24k/y. Enough to pay an engineer for a month to optimize the most common task it runs :).
Assuming you're running a decent machine and the OS is not doing much else: Not a large factor, I'd expect less than a 10% improvement.
Just the OS 'idling' doesn't (shouldn't) take up much of the processing power of the CPU. If it is, you need a better machine, a better OS, a format or some combination of these.
If, on the other hand, you're running a bunch of other resource-intensive things, obviously expect that this can be sped up a lot by just not running those other things.
If you're not a super-user, you may be surprised to find that there are a ton of (non-OS) processes running in the background, these are more likely to take up CPU processing power that the OS.
Slightly off topic but related, keep in mind that, if you're running 8 cores, you can, in a perfect world, speed up the process by 8x by multi-threading.
Expect a way bigger improvement from known solutions to problems and making better use of data structures and algorithms, and, to a lesser extent, the choice of language and micro-optimizations.
From my experience:
Not the most scientific or trustable result but, most of the time when I open up Task Manager on Windows, all the OS processes are below 1% of the CPU.
There is a super-computer answer, and a multi-cores answer already, so here is the GPGPU answer.
When a super-computer is overkill, but a multi-core CPU is under-powered, and your algorithm is sensibly parallelizable, consider adapting it to a GPGPU. Many of the benefits of a super-computer solution are available, in reduced form at reduced cost, by performing CPU-intensive tasks on a GPGPU.
Here is a link to an analysis I performed last year on implementing, and tuning, a brute-force solution to the Travelling Salesman Problem using a compute capability 2.0 NVIDIA Graphics card, CUDAfy, and C#.
Related
I am running only one program on my computer to crunch numbers, and it takes up about 25% CPU (all other built-in applications are less than 4% CPU). Since this is the only program I am running, how do I raise the CPU percentage from 25% to 40%? I know changing the priority doesn't really help that much, or the affinity. I am using Windows 10. Thanks for help!
Distributing a demanding computational task (aka number crunching) among multiple processors or cores is usually not a trivial challenge. Likelihood of success depends upon how easy it is to divide the problem into sub-problems, each of which either doesn't need to communicate with each other or need a sufficiently small amount of communication such that communication overhead does not spoil all the speed gain you theoretically could get by using multiple processors.
Being as it is, this is usually a case-by-case decision. If you are lucky there is a special library for your problem domain for you ready to use.
Examples of problems that lend themselves to parallelization (quite) well are
video encoding (different time sections are practically independent of each other and can be encoded separately)
fractals like Mandelbrot (any area of the fractal is completely independent of the others)
explicit structural mechanics equations like crash solvers (solution volumes interact only across surface boundaries, so some communication between processors is necessary, but not much)
Examples of things that don't go so well with parallelization:
dense matrix inversion (maximum dependence of each component from every other component)
implicit structural mechanics equations, like nonlinear equilibrium solution (requires matrix inversion to solve, so same problem as before)
You (probably) can't do that.
The reason that you cannot do that is probably that the program you run is Single-Threaded and you have a Quad-Core-Processor. 25% is a quarter of a whole (processor). This means that one core of your Four-Core-Processor is fully used - resulting in a 25% usage.
Unless you can make your software Multi-Threaded (that means using multiple cores parallely) you are stuck with this limit.
Preface: I'm sorry that this a very open-ended question, since it would be quite complex to go into the exact problem I am working on, and I think an abstract formulation also contains the necessary detail. If more details are needed though, feel free to ask.
Efficiency in GPU computing comes from being able to parallelize calculations over thousands of cores, even though these run more slowly than traditional CPU cores. I am wondering if this idea can be applied to the problem I am working on.
The problem I am working on is an optimisation problem, where a potential solution is generated, the quality of this solution calculated, and compared to the current best solution, in order to approach the best solution possible.
In the current algorithm, a variation of gradient descent, the calculating of this penalty is what takes by far the most processor time (Profiling suggest around 5% of the time is used to generate a new valid possibility, and 95% of the time is used to calculate the penalty). However, the calculating of this penalty is quite a complex process, where different parts of the (potential) solution depend on eachother, and are subject to multiple different constraints for which a penalty may be given to the solution - the data model for this problem currently takes over 200MB of RAM to store.
Are there strategies in which to write an algorithm for such a problem on the GPU? My problem is currently that the datamodel needs to be loaded for each processor core/thread working the problem, since the generating of a new solution takes so little time, it would be inefficient to start using locks and have to wait for a processor to be done with its penalty calculation.
A GPU obviously doesn't have this amount of memory available for each of its cores. However, my understanding is that if the model were to be stored on RAM, the overhead of communication between the GPU and the CPU would greatly slow down the algorithm (Currently around 1 million of these penalty calculations are performed every second on a single core of a fairly modern CPU, and I'm guessing a million transfers of data to the GPU every second would quickly become a bottleneck).
If anyone has any insights, or even a reference to a similar problem, I would be most grateful, since my own searches have not yet turned up much.
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I have been doing scientific computing on C, Python and Matlab. When I run a piece of code on a desktop PC, it might take hours to complete. However, during this time, less than 100% of CPU and less than 100% of memory is used.
Where is the bottleneck then? Naive question: Why can't the PC throw more processing power at the algorithm to make it run faster?
Edit
In particular, I am currently running a vectorized loop (that does not do any I/O) in Matlab that has been going on for 2 hours and task manager says 40-38% CPU usage (and 28% memory) all these time. Why doesn't the PC use 90% CPU instead and does this faster?
Are you doing any I/O? Are you running any other processes?
At any instant, the computer is either running your program (100%) or something else (0%), so what you see is a time-average.
If you do any I/O, your program has to wait while it happens, and that comes off of the 100%.
As far as memory, your program uses what it uses, which may or may not be all the RAM available.
BTW, just because it's using 100% of the CPU doesn't mean it's being fast. Your Python and Matlab code is likely to require 10-100 cycles to do the same thing C does in one cycle, simply because those languages are interpreted and/or do a lot more memory management.
Try changing the process's priority (or processes if there is more than one used). Depending upon the OS, you can practically cause the entire CPU to be dedicated to your process. Just be sure you understand the implications. I have done this for testing.
Now majority of CPUs feature multiple cores and the program may run faster if executed on multiple threads. However the languages you list do not support multi-threading very easily (while in C you may try with pthreads or MPI).
A fast solution may be simply run two or three instances of your program at the same time, if you need to try different input data or algorithm versions, for instance. It seems also you have enough memory for this.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm very interested in learning FPGA development. I've found a bunch of "getting started with FPGA" questions here, and other tutorials and resources on the internet. But I'm primarily interested in using FPGAs as an accelerator, and I can't figure out what devices will actually offer a speed up over a desktop CPU (say a recent i7).
My particular interest at the moment is cellular automata (and other parallel environments like neural networks and agent based modeling). I'd like to experiment with 3d or higher dimensional cellular automatas. My question is - will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU? Or would I need to spend more and get a higher end model FPGA?
The FPGA can be a very good accelerator, but (and this is a big BUG) it is usually is very expensive. We have here machines like the beecube, a convey or from Dini godzillas part time nanny, and they are all very expensive (>10k$) and even with these machines many applications can be better accelerated with a standard cpu cluster or gpus. That FPGA is a bit better when the total cost of ownership is considered as you have there usually a better engery efficiency.
But there are applications that you can accelerate. On the lower scale you can/should do an rough estimate if its worth for you application, but you need there more concrete numbers for your application. Consider an standard deskop cpu: usually it has at least 4 cores (or dual with hyperthreading, not to mention the vector units), and clocks at, say 3 GHz. This results in 12 GCycles per second computation power. The (cheap) FPGAs you can get to 250 MHz (better can reach up to 500 MHz, but that must be very friendly designs and very good speed grades), so you need approx. 50 Operations in parallel, to compete with the CPU (actually its a bit better because the cpu has usually not 1 cycle ops, but it also has vector operations so we are equal).
50 Operations sounds much, and is hard, but is doable (the magic word here is pipeling). So you should know exactly how you are going to implement you design in hardware and which degree of parallelism you can use.
Even if you solve that parallelism problem, we come now to the real problem: The memory.
Above mentioned accelerators have so much compute capacity, they could do thousands things in parallel, but the real problem with such computation power is: how to get the data into/out of them. And you also have this problem in your small scale. In your desktop PC the cpu transfers more than 20GB/s to/from memory (good GPU card make 100GB/s and more), while your small accelerator for 100-200$ has at most (when you get lucky) 1-2 GB/s per PCI-Exp.
If its worth for you, depends completely on your application (and here you need far more details than: 3D Cellular Automatas, you must know the neighbourhoods, the required precision (do you double, single float, or integers or fixpoint...?), and your use case (do you transfer initial cell values, let the machine compute 2 day, and than transfer cell values back, or do you need the cell values after every step (this makes a huge difference in the required bandwidth while computation)).
But overall, without knowing more, I would say: Its 100$-200$ worth.
But not because you can compute your cellular automatas faster (which I dont believe), but because you will lern. And you will not only learn to design hardware and the development on FPGAs, but I see with our students that we have here, always that they get with the hardware design knowledge also a far better understanding on how the hardware actually look and behaves like. Sure nothing what you do on your FPGA is direct related to the interior of the cpu, but many get a better feeling for what hardware in general is capable of, which in turn make them even more effective software developer.
But I have also to admit: You are going to pay a much higher price than just the 100-200$: You have to spent really much time on it.
Disclaimer: I work for a reconfigurable system developer/manufacturer.
A short answer to your question "will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU" is probably not.
A longer answer:
A microprocessor is a set of fixed, shared functional units tuned to perform reasonably well across a broad range of applications. The operating system and compilers do a good job of making sure that these fixed, shared functional units are utilized appropriately.
FPGA based systems get their performance from dedicated, dense, computational efficiency. You create exactly what you need to execute your application, no more, no less - and whatever you create is not shared with any other user, process, operating system, whatever. If you need 80 floating point units, you create 80 dedicated floating point units that run in parallel. Compare that to a microprocessor scheduling floating point operations across some smaller number of floating point units. To get performance faster than a microprocessor, you have to instantiate enough dedicated FPGA-based functional units to make a performance difference vs. a microprocessor. This often requires the resources in the larger FPGA devices.
An FPGA alone is not enough. If you create a large number of efficient computational engines in an FPGA you -have- to keep these engines fed with data. This requires some number of high bandwidth connections to large amounts of data memory around the FPGA. What you often see with I/O based FPGA cards is some of the potential performance gain is usually diminished by moving data back and forth across the I/O bus.
As a data point, my company uses the '530 Stratix IV FPGA from Altera. We surround it with several directly coupled memories and tie this subsystem directly into the microprocessor memory. We get several advantages over microprocessor systems for many applications, but this is not a $100-$200 starter kit, this is a full-blown integrated system.
I'm wondering what kind of performance hit numerical calculations will have in a virtualized setting? More specifically, what kind of performance loss can I expect from running CPU-bound C++ code in a virtualized windows OS as opposed to a native Linux one, on rather fast x86_64 multi-core machines?
I'll be happy to add precisions as needed, but as I don't know much about virtualization, I don't know what info is relevant.
Processes are just bunches of threads which are streams of instructions executing in a sequential fashion. In modern virtualisation solutions, as far as the CPU is concerned, the host and the guest processes execute together and differ only in that the I/O of the latter is being trapped and virtualised. Memory is also virtualised but that occurs more or less in the hardware MMU. Guest instructions are directly executed by the CPU otherwise it would not be virtualisation but rather emulation and as long as they do not access any virtualised resources they would execute just as fast as the host instructions. At the end it all depends on how well the CPU could cope with the increased number of running processes.
There are lightweight virtualisation solutions like zones in Solaris that partition the process space in order to give the appearance of multiple copies of the OS but it all happens under the umbrella of a single OS kernel.
The performance hit for pure computational codes is very small, often under 1-2%. The catch is that in reality all programs read and write data and computational codes usually read and write lots of data. Virtualised I/O is usually much slower than direct I/O even with solutions like Intel VT-* or AMD-V.
Exact numbers depend heavily on the specific hardware.
Goaded by #Mitch Wheat's unarguable assertion that my original post here was not an answer, here's an attempt to recast it as an answer:
I work mostly on HPC in the energy sector. Some of the computations that my scientist colleagues run take O(10^5) CPU-hours, we're seriously thinking about O(10^6) CPU-hours jobs in the near future.
I get well paid to squeeze every last drop of performance out of our codes, I'd think it was a good day's work if I could knock 1% off the run-time of some of our programs. Sometimes it has taken me a month to get that sort of performance improvement, sure I may be slow, but it's still cost-effective for our scientists.
I shudder therefore, when bright salespeople offering the latest and best in data center software (of which virtualization is one aspect) which will only, as I see it, shackle my codes to a pile of anchor chain from a 250,00dwt tanker (that was a metaphor).
I have read the question carefully and understand that OP is not proposing that virtualization would help, I'm offering the perspective of a practitioner. If this is still too much of a comment, do the SO thing and vote to close, I promise I won't be offended !