Related
I am planning to write something to take advantages of the many devices that I have at home.
Basically my aim is to use the laptop to execute calculations, and also to use my main desktop computer to add more power (and finish the task quicker). I work with cellular simulation and chemical interactions, so to me would be great to take advantage of all that I have available at home.
I am using mainly OSX, so I need something that may work with that OS. I can code in objective-C, C and C++.
I am aware of GCD, OpenCL and MPI, but I am not sure which way to go.
I was planning to not use the full power of my desktop but only some of the available cores (in this way I can continue to work on the desktop doing other tasks that are not so resource intensive). In particular I would love to use the graphic card power (it is an ATI card, so no CUDA), since all that I do mainly is spreadsheet, word and coding with Xcode, and the graphic card resources are basically unused in that scenario.
Is there a specific set of libraries or API, among the aforementioned 3, that would allow me to selectively route tasks, and use resources on another machine without leaving the control totally to the compiler? I've heard that GCD is great but it has very limited control on where the blocks are executed, while MPI is on the other side of the spectrum....OpenCL seems to be in the middle.
Before diving in one of these technologies I would like to know which one would most likely suit my needs; I am sure that some other researcher has already used successfully parallel computing to achieve what I am trying to achieve.
Thanks in advance.
MPI is more for scientific computing large scale many processors many nodes exc not for a weekend project, for what you describe I would suggest using OpenCl or any one the more distributed framework of AMQP protocol families, such as zeromq or rabbitMQ, or a combination of OpenCl and AMQP , or even simpler consider multithreading , i would suggest OpenMP for that. I'm not sure if you are looking for direct solvers or parallel functions but there are many that exist as well for gpu's and cpu's which you can find on the web
Sorry, but this question simply cannot be meaningfully answered as posed. To be sure, I could toss out a collection of buzzwords describing various technologies to look at like GCD, OpenMPI, OpenCL, CUDA and any number of other technologies which allow one to run a single program on multiple cores, multiple programs on different cooperating computers, or a single program distributed across CPU and GPU, and it sounds like you know about a number of those already so I wouldn't even be adding much value in listing the buzzwords.
To simply toss out such terms without knowing the full specifics of the problem you're trying to solve, however, is a bit like saying that you know English, French and a little German so sure, by all means - mix them all together in a single paragraph without knowing anything about the target audience! Similarly, you can parallelize a given computation in any number of ways, across any number of different processing elements, but whether that parallelization is actually a win or not is going to be entirely dependent on the nature of the algorithm, its data dependencies, how much computation is expected for each reasonable "work chunk", and whether it can be executed on a GPU with sufficient numeric precision, among many other factors. The more complex the technology you choose, the more those factors matter and the greater the possibility that the resulting code will actually be slower than its single-threaded, single machine counterpart. IPC overhead and data copying can, and frequently do, swamp all of the gains one might realize from trying to naively parallelize something and then add additional overhead on top of that, resulting in a net loss. This is why engineers who can do this kind of work meaningfully and well are in such high demand. :)
Without knowing anything about your calculations, I would move in baby steps. First try a simple multi-processor framework like GCD (which is already built in to OS X and requires no additional dependencies to use) and figure out how to factor your code such that it can effectively use all of the available cores on a single machine. Once you've learned where the wins are (and if there even are any - if multi-threading isn't helping, multi-machine parallelization almost certainly won't either), try setting up several instances of the calculation on several machines with a simple IPC model that allows for distributing the work. Having already factored your algorithm(s) for multiple threads, it should be comparatively straight-forward to further generalize the approach across multiple machines (though it bears noting that the two are NOT the same problem and either way you still want to use all the cores available on any of the given target machines, so the two challenges are both complimentary and orthogonal).
This is actually a 2 part question:
For people who want to squeeze every clock cycle, people talk about pipelines, cache locality, etc.
I have seen these low level performance techniques mentioned here and there but I have not seen a good introduction to the subject, from start to finish. Any resource recommendations? (Google gave me definitions and papers, where I'd really appreciate some kind of worked examples/tutorials real-life hands-on kind of materials)
How does one actually measure this kind of things? Like, as in a profiler of some sort? I know we can always change the code, see the improvement and theorize in retrospect, I am just wondering if there are established tools for the job.
(I know algorithm optimization is where the orders of magnitudes are. I am interested in the metal here)
The chorus of replies is, "Don't optimize prematurely." As you mention, you will get a lot more performance out of a better design than a better loop, and your maintainers will appreciate it, as well.
That said, to answer your question:
Learn assembly. Lots and lots of assembly. Don't MUL by a power of two when you can shift. Learn the weird uses of xor to copy and clear registers. For specific references,
http://www.mark.masmcode.com/ and http://www.agner.org/optimize/
Yes, you need to time your code. On *nix, it can be as easy as time { commands ; } but you'll probably want to use a full-features profiler. GNU gprof is open source http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html
If this really is your thing, go for it, have fun, and remember, lots and lots of bit-level math. And your maintainers will hate you ;)
EDIT/REWRITE:
If it is books you need Michael Abrash did a good job in this area, Zen of Assembly language, a number of magazine articles, big black book of graphics programming, etc. Much of what he was tuning for is no longer a problem, the problems have changed. What you will get out of this is the ideas of the kinds of things that can cause bottle necks and the kinds of ways to solve. Most important is to time everything, and understand how your timing measurements work so that you are not fooling yourself by measuring incorrectly. Time the different solutions and try crazy, weird solutions, you may find an optimization that you were not aware of and didnt realize until you exposed it.
I have only just started reading but See MIPS Run (early/first edition) looks good so far (note that ARM took over MIPS as the leader in the processor market, so the MIPS and RISC hype is a bit dated). There are a number of text books old and new to be had about MIPS. Mips being designed for performance (At the cost of the software engineer in some ways).
The bottlenecks today fall into the categories of the processor itself and the I/O around it and what is connected to that I/O. The insides of the processor chips themselves (for higher end systems) run much faster than the I/O can handle, so you can only tune so far before you have to go off chip and wait forever. Getting off the train, from the train to your destination half a minute faster when the train ride was 3 hours is not necessarily a worthwhile optimization.
It is all about learning the hardware, you can probably stay within the ones and zeros world and not have to get into the actual electronics. But without really knowing the interfaces and internals you really cannot do much performance tuning. You might re-arrange or change a few instructions and get a little boost, but to make something several hundred times faster you need more than that. Learning a lot of different instruction sets (assembly languages) helps get into the processors. I would recommend simulating HDL, for example processors at opencores, to get a feel for how some folks do their designs and getting a solid handle on how to really squeeze clocks out of a task. Processor knowledge is big, memory interfaces are a huge deal and need to be learned, media (flash, hard disks, etc) and displays and graphics, networking, and all the types of interfaces between all of those things. And understanding at the clock level or as close to it as you can get, is what it takes.
Intel and AMD provide optimization manuals for x86 and x86-64.
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html/
http://developer.amd.com/documentation/guides/pages/default.aspx
Another excellent resource is agner.
http://www.agner.org/optimize/
Some of the key points (in no particular order):
Alignment; memory, loop/function labels/addresses
Cache; non-temporal hints, page and cache misses
Branches; branch prediction and avoiding branching with compare&move op-codes
Vectorization; using SSE and AVX instructions
Op-codes; avoiding slow running op-codes, taking advantage of op-code fusion
Throughput / pipeline; re-ordering or interleaving op-codes to perform separate tasks avoiding partial stales and saturating the processor's ALUs and FPUs
Loop unrolling; performing multiple iterations for a single "loop comparison, branch"
Synchronization; using atomic op-code (or LOCK prefix) to avoid high level synchronization constructs
Yes, measure, and yes, know all those techniques.
Experienced people will tell you "don't optimize prematurely", which I relate as simply "don't guess".
They will also say "use a profiler to find the bottleneck", but I have a problem with that. I hear lots of stories of people using profilers and either liking them a lot or being confused with their output.
SO is full of them.
What I don't hear a lot of is success stories, with speedup factors achieved.
The method I use is very simple, and I've tried to give lots of examples, including this case.
I'd suggest Optimizing subroutines in assembly
language
An optimization guide for x86 platforms.
It's quite heavy stuff though ;)
This is a follow-up to this question. I'm currently writing a simple game and am looking for the fastest way to (repeatedly) display an array of RGB data in a Win32 window, without flickering or other artifacts.
Several different approaches were recommended in the answers to the previous question, but there was no consensus on which would be the fastest. So, I threw together a test program. The code simply displays a framebuffer on the screen repeatedly, as fast as possible.
These are the results I obtained, for 32-bit data running in a 32-bit video mode - they may surprise some people:
- Direct3D (1): 500 fps
- Direct3D (2): 650 fps
- DirectDraw (3): 1100 fps
- DirectDraw (4): 800 fps
- GDI (SetDIBitsToDevice): 2000 fps
Given these figures:
Why are many people adamant that GDI is simply too slow for this operation?
Is there any reason to prefer DirectDraw or Direct3D over SetDIBitsToDevice?
Here is a brief summary of the calls made by each of the Direct* codepaths. If anyone knows a more efficient way to use DirectDraw/Direct3D, please comment.
1. CreateTexture(D3DUSAGE_DYNAMIC, D3DPOOL_DEFAULT);
LockRect(); memcpy(); UnlockRect(); DrawPrimitive()
2. CreateTexture(0, D3DPOOL_SYSTEMMEM); CreateTexture(0, D3DPOOL_DEFAULT);
LockRect(); memcpy(); UnlockRect(); UpdateTexture(); DrawPrimitive()
3. CreateSurface(); SetSurfaceDesc(lpSurface = &frameBuffer[0]);
memcpy(); primarySurface->Blt();
4. CreateSurface();
Lock(); memcpy(); Unlock(); primarySurface->Blt();
There are a couple of things to keep in mind here. First of all, a lot of "common knowledge" is based on some facts that no longer really apply.
In the days of AGP, when the CPU talked directly to the GPU, it always used the base PCI protocol, which happened at the "1x" rate (always and inevitably). AGX 2x/4x/8x only applied when the GPU was taking to the memory controller directly. In other words, depending on when you looked, it was up to 8 times as fast to have the GPU load a texture from memory as it was for the CPU to send the same data directly to the GPU. Of course, the CPU also had a great deal more bandwidth to memory than the PCI bus supported.
When things switched to PCI-E, however, that changed completely. While there can be differences in bandwidth depending on path, there's no general rule that memory->GPU will be faster than CPU->GPU. The one generalization that's (mostly) safe is that if you have a dedicated graphics card, then the GPU will almost always have more bandwidth to the memory on the graphics card than it does to main memory on the motherboard.
In your case, that doesn't matter much though -- you're talking about moving data from CPU space to GPU space regardless. The main speed difference with using DirectX (or OpenGL) happens when you keep all (or most) of the computation on the GPU, and avoid using the CPU (or main memory) at all. They don't (now that AGP is history) provide any substantial improvement in memory->display bandwidth.
Jerry Coffin makes some good points. The thing to bear in mind is what the DI stands for in SetDIBitsToDevice. It stands for Device Independent. Which means you were ALWAYS at the mercy of drivers. Some drivers used to be complete rubbish and it affected the performance massively. DirectDraw suffered from similar issues as well ... but you also had access to the hardware blitters so it was generally more useful. IHVs also tended to put more time in to writing proper drivers for DirectDraw because of its gaming association. Who wants to be the bottom of the performance pile when the hardware is quite capable of doing better?
These days many graphics cards can accept the bit data directly so no conversion happens. If it does need to be swizzled this is also INCREDIBLY quick in this day and age.
The reason your Direct3D performance is so terrible, by comparison, is that Direct3D, by nature of the fact it is meant to be used totally internally to the GPU, uses odd and complex formats to improve cache performance and so forth.
Couple that with the fact that you aren't testing like for like (with DDraw and D3D) by creating a texture/surface, locking it, copying, unlocking and then drawing over the back buffer (via various methods). To get best performance you'd be best off directly locking the backbuffer using a DISCARD lock then memcpy'ing directly into the returned buffer before unlocking. This will bring your performance much closer to the SetDIBitsToDevice. I still would expect D3D to be slower than DDraw, however, for the reasons outlined above.
The reason you will hear people trounce on GDI is that it used to just be old windows API calls. The newer versions of it (that were called GDI+ when I last looked at em) are actually just an API placed on top of DirectX calls. So using GDI may seem fairly simple programming wise at times, but adding a layer between things always slows things down. As mentioned in the response from Jerry Coffin, your examples are about moving the data, and that is the slow time. I am a bit surprised that DirectX is that much slower though but I can not be much more help with out digging through the DirectX documentation (which has been pretty awesome for quite some time really.. Might want to check out www.codesampler.com. I have always found good starting places from him and actually, while I may be insane for saying this, I would swear the improvements to the DirectX SDK in doc and examples were done based on this guys work!)
As for the DirectDraw vs Direct3D (and not the GDI calls) discussion. I would say go to Direct3D. I believe DirectDraw has been deprecated since 8.0 or so, and 9.0 has been around for quite a long while. And at the end of the day all of DirectX is 3D, it just varies on the levels of helpful 2D apis that are around, but you may find you can do some very interesting things in a 2D environment when you are actually using 3D space. (I had a pretty neat randomly generated lightning weapon for a space invaders clone at one time :))
Anywho, hope this helped!
PS: It should be noted that DirectX is not always the fastest. For keyboard input (unless this has changed in 10 or 11) it has pretty much always been recommended to use the windows events.. as DirectInput was actually just a wrapper for that system!.. XInput however is -awesome-!!
As a software developer dealing mostly with high-level programming languages I'm not sure what I can do to appropriately pay attention to the upcoming omni-presence of multicore computers. I write mostly ordinary and non-demanding applications, nevertheless I think it is important to know if I need to change any programming paradigms or even language to master the future.
My question therefore:
How to deal with increasing multicore presence in day-by-day hacking?
Herb Sutter wrote about it in 2005: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software
Most problems do not require a lot of CPU time. Really, single cores are quite fast enough for many purposes. When you do find your program is too slow, first profile it and look at your choice of algorithms, architecture, and caching. If that doesn't get you enough, try to divide the problem up into separate processes. Often this is worth doing simply for fault isolation and so that you can understand the CPU and memory usage of each process. Also, normally each process will run on a specific core and make good use of the processor caches, so you won't have to suffer the substantial performance overhead of keeping cache lines consistent. If you go for a multi process design and still find problem needs more CPU time than you get with the machine you have, you are well placed to extend it run over a cluster.
There are situations where you need multiple threads within the same address space, but beware that threads are really hard to get right. Race conditions, especially in non-safe languages, sometimes take weeks to debug; often, simply adding tracing or running under a debugger will change the timings enough to hide the problem. Simply putting locks everywhere often means you get a lot of locking overhead and sometimes so much lock contention that you don't really get the concurrency advantage you were hoping for. Even when you've got the locking right, you then need to profile to tune for cache coherency. Ultimately, if you want to really tune some highly concurrent code, you'll probably end up looking at lock-free constructs and more complex locking schemes than those in current multi-threading libraries.
Learn the benefits of concurrency, and the limits (e.g. Amdahl's law).
So you can, where possible, exploit the only route for higher performance that is going to be open. There is a lot of innovative work happening on easier approaches (futures and task libraries), and old work being rediscovered (functional languages and immutable data).
The free lunch is over, but that does not mean that there is nothing to exploit.
In general, become very friendly with threading. It's a terrible mechanism for parallelization, but it's what we have.
If you do work with .NET, look at the Parallel Extensions. They allow you to easily accomplish many parallel programming tasks.
To benefit from more that just one core you should consider parallelizing your code. Multiple threads, immutable types, and a minimum of synchronization are your new friends.
I think it will depend on what kind of applications you're writing.
Some kind of apps benefit more of the fact that they're run on a mutli-core cpu then others.
If your application can benefit from the multi-core fact, then you should be ready to go parallel.
The free lunch is over; that is: in the past, your application became faster when a new cpu was released and you didn't have to put any effort in your application to get that extra speed.
Now, to take advantage of the capabilities a multi-core cpu offers, you've to make sure that your application can take advantage of it. That is: you've to see which tasks can be executed multithreaded / concurrently, and this brings some issues to the table ...
Learn Erlang/F# (depending on your platform)
Prefer immutable data structures, their use makes software easier to understand not only in concurrent programs.
Learn the tools for concurrency in your language (e.g. java.util.concurrent, JCIP).
Learn a functional language (e.g Haskell).
I've been asked the same question, and the answer is, "it depends". If your Joe Winforms, maybe not so much. If your writing code that must be performant, yes. One of the biggest problem I can see with parallel programming is this: if something can't be parallized, and you lie and tell the run-time to do in parallel anyways, it's not going to crash, it's just going to do things wrong, and you'll get crap results and blame the framework.
Learn OpenMP and MPI for C and C++ code.
OpenMP also applies to other languages as well like Fortran I suppose.
Write smaller programs.
Other code languages/styles will let you do multithreading better (though multithreading is still really hard in any language) but the big benefit for regular developers, IMHO, is the ability to execute lots of smaller programs concurrently to accomplish some much larger task.
So, get in the habit of breaking your problems down into independent components that can be run whenever you want.
You'll build more maintainable software too.
I'd like to do some work with the nitty-gritties of computer imaging. I'm looking for a way to read single pixels of data, analyze them programatically, and change them. What is the best language to use for this (Python, c++, Java...)? What is the best fileformat?
I don't want any super fancy software/APIs... I'm looking for the bare basics.
If you need speed (you'll probably always want speed with image processing) you definitely have to work with raw pixel data.
Java has some real disadvantages as you cannot access memory directly which makes pixel access quite slow compared to accessing the memory directly.
C++ is definitely the language of choice for production use image processing. But you can, for example, also use C# as it allows for unsafe code in specific areas. (Take a look at the scan0 pointer property of the bitmapdata class.)
I've used C# successfully for image processing applications and they are definitely much faster than their java counterparts.
I would not use any scripting language or java for such a purpose.
It's very east to manipulate the large multi-dimensional or complex arrays of pixel information that are pictures using high-level languages such as Python. There's a library called PIL (the Python Imaging Library) that is quite useful and will let you do general filters and transformations (change the brightness, soften, desaturate, crop, etc) as well as manipulate the raw pixel data.
It is the easiest and simplest image library I've used to date and can be extended to do whatever it is you're interested in (edge detection in very little code, for example).
I studied Artificial Intelligence and Computer Vision, thus I know pretty well the kind of tools that are used in this field.
Basically: you can use whatever you want as long as you know how it works behind the scene.
Now depending on what you want to achieve, you can either use:
C language, but you will lose a lot of time in bugs checking and memory management when implementing your algorithms. So theoretically, this is the fastest language to do that kind of job, but if your algorithms are not computationnally efficient (in terms of complexity) or if you lose too much time in bugs checking, this is clearly not worth it. So I would advise to first implement your application in another language, and then later you can always optimize small parts of your code with C bindings.
Octave/MatLab: very efficient language, almost as much as C, and you can make very elegant and succinct algorithms. If you are into vectorization, matrix and linear operations, you should go with that. However, you won't be able to develop a whole application with this language, it's more focused on algorithms, but then you can always develop an interface using another language later.
Python: all-in-one elegant and accessible language, used in gigantically large scale applications such as Google and Facebook. You can do pretty much everything you want with Python, any kind of application. It will be perfectly adapted if you want to make a full application (with client interaction and all, not only algorithms), or if you want to quickly draft a prototype using existent libraries since Python has a very large set of high quality libraries, like OpenCV. However if you only want to make algorithms, you should better use Octave/MatLab.
The answer that was selected as a solution is very biaised, and you should be careful about this kind of archaic comment.
Nowadays, hardware is cheaper than wetware (humans), and thus, you should use languages where you will be able to produce results faster, even if it's at the cost of a few CPU cycles or memory space.
Also, a lot of people tends to think that as long as you implement your software in C/C++, you are making the Saint Graal of speedness: this is just not true. First, because algorithms complexity matters a lot more than the language you are using (a bad algorithm will never beat a better algorithm, even if implemented in the slowest language in the universe), and secondly because high-level languages are nowadays doing a lot of caching and speed optimization for you, and this can make your program run even faster than in C/C++.
Of course, you can always do everything of the above in C/C++, but how much of your time are you willing to waste to reinvent the wheel?
Not only will C/C++ be faster, but most of the image processing sample code you find out there will be in C as well, so it will be easier to incorporate things you find.
if you are looking to numerical work on your images (think matrix) and you into Python check out http://www.scipy.org/PyLab - this is basically the ability to do matlab in python, buddy of mine swears by it.
(This might not apply for the OP who only wanted the bare basics -- but now that the speed issue was brought up, I do need to write this, just for the record.)
If you really need speed, it's better to forget about working on the pixel-by-pixel level, and rather see whether the operations that you need to perform could be vectorized. For example, for your C/C++ code you could use the excellent Intel IPP library (no, I don't work for Intel).
It depends a little on what you're trying to do.
If runtime speed is your issue then c++ is the best way to go.
If speed of development is an issue, though, I would suggest looking at java. You said that you wanted low level manipulation of pixels, which java will do for you. But the other thing that might be an issue is the handling of the various file formats. Java does have some very nice APIs to deal with the reading and writing of various image formats to file (in particular the java2d library. You choose to ignore the higher levels of the API)
If you do go for the c++ option (or python come to think of it) I would again suggest the use of a library to get you over the startup issues of reading and writing files. I've previously had success with libgd
What language do you know the best? To me, this is the real question.
If you're going to spend months and months learning one particular language, then there's no real advantage in using Python or Java just for their (to be proven) development speed.
I'm particularly proficient in C++ and I think that for this particular task I can be as speedy as a Java programmer, for example. With the aid of some good library (OpenCV comes to mind) you can create anything you need in a matter of a couple of lines of C++ code, really.
Short answer: C++ and OpenCV
Short answer? I'd say C++, you have far more flexibility in manipulating raw chunks of memory than Python or Java.