I'm currently researching compute shaders. It is said, that under certain circumstances compute shaders can produce vastly faster rendering results than the standard (say OpenGL) hardware pipeline.
I understand that it has something to do with work groups and work group invocations and as such with thread utilization and parallel computation.
But I still haven't come across a good source that explains the parallels and differences between the standard rendering pipeline and compute pipeline in terms of actual performance.
Any help is appreciated :)
Related
I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.
http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.
Thank you!
As indicated by #JackOLantern, dynamic parallelism can be described in a nutshell as the ability to call a kernel (i.e. a __global__ function) from device code (a __global__ or __device__ function).
Since the kernel call is the principal method by which the machine spins up multiple threads in response to a single function call, there is really no direct alternative that provides all the capability of dynamic parallelism in a device that does not support it (ie. pre cc 3.5 devices).
Without dynamic parallelism, your overall code will almost certainly involve more synchronization and communication between CPU code and GPU code.
The principal method would be to realize some unit of your code as parallelizable, convert it to a kernel, and work through your code in essentially a non-nested fashion. Repetetive functions might be done via looping in the kernel, or else looping in the host code that calls the kernel.
For a pictorial example of what I am trying to describe, please refer to slide 14 of this deck which introduces some of the new features of CUDA 5 including dynamic parallelism. The code architecture on the right is an algorithm realized with dynamic parallelism. The architecture on the left is the same function realized without dynamic parallelism.
I have checked your algorithm in Wikipedia and I'm not sure you need dynamic parallelism at all.
You do the expectation step in your kernel, __syncthreads(), do the maximization step, and __syncthreads() again. From this distance, the expectation looks like a reduction primitive, and the maximization is a filter one.
If it doesn't work, and you need real task parallelism, a GPU may not be the best choice. While the Kepler GPUs can do that to some degree, this is not what this architecture is designed for. In that case you might be better off using a multi-CPU system, such as an office grid, a supercomputer, or a Xeon Phi accelerator. You should also check OpenMP and MPI, these are the languages used for task-parallel programming (actually OpenMP is just a handful of pragmas in most cases).
I am going to have a lecture on OpenMP and I want to write an program using OpenMP lively . What program do you suggest that has the most important concept of OpenMP and has noticeable speedup? I want an awesome program example, please help me all of you that you are expert in OpenMP
you know I am looking for an technical and Interesting example with nice output.
I want to write two program lively , first one for better illustration of most important OpenMP concept and has impressive speedup and second-one as a hands-on that everyone must write that code at the same time
my audience may be very amateur
Personally I wouldn't say that the most impressive aspect of OpenMP is the scalability of the codes you can write with it. I'd say that a more impressive aspect is the ease with which one can take an existing serial program and, with only a few OpenMP directives, turn it into a parallel program with satisfactory scalability.
So I'd suggest that you take any program (or part of any program) of interest to your audience, better yet a program your audience is familiar with, and parallelise it right there and then in your lecture, lively as you put it. I'd be impressed if a lecturer could show me, say, a 4 times speedup on 8 cores with 5 minutes coding and a re-compilation. And that leads on to all sorts of interesting topics about why you don't (always, easily) get 8 times speedup on 8 cores.
Of course, like all stage illusionists, you'll have to choose your example carefully and rehearse to ensure that you do get an impressive-enough speedup to support your argument.
Personally I'd be embarrassed to use an embarrassingly parallel program for such a demo; the more perceptive members of the audience might be provoked into a response such as meh.
(1) Matrix multiply
Perhaps it's the most simple example (though matrix addition would be simpler).
(2) Mandelbrot
http://en.wikipedia.org/wiki/Mandelbrot_set
Mandelbrot is also embarrassingly parallel, and OpenMP can achieve decent speedups. You can even use graphics to visualize it. Mandelbrot is also an interesting example because it has workload imbalance. You may see different speedups based on scheduling policies (e.g., schedule(dynamic,1) vs. schedule(static)), and different threading libraries (e.g., Cilk Plus or TBB).
(3) A couple of mathematical kernels
For example, FFT (non-recursive version) is also embarrassingly parallelized.
Take a look at "OmpSCR" benchmarks: http://sourceforge.net/projects/ompscr/ This suite has simple OpenMP examples.
I am searching for good (preferably plug-and-play) solutions for performing diagnostics on software I am developing. The software I am working on has several components that require extensive computing resources, and so we're attempting to capture the performance of these components for two reasons: 1) estimate required computing resources and thus the costs of running the software, and 2) quantify what an "improvement" is for the component (i.e. if we modify the code and speed increases, then it's an improvement). Our application is composed of a search engine plus many other components, and understanding the speed of the search engine is also critical to the end-user.
It seems to be hard to search for a solution since I'm not sure how to properly define my problem. But what I've found so far seems to be basic error logging techniques. A solution whose purpose is to run statistics (e.g. statistical regressions) off of the data would be best. Maybe unit testing frameworks have built-in test timers, but we need to capture data from live runs of our application to account for the numerous different scenarios.
So really there are two questions:
1) Is there a predefined solution for these sorts of tests?
2) Is there any good reference for running statistical regressions on this kind of data? Let's say we captured execution time of the script and size of the input data (e.g. query). We can regress time on data size to understand the effect of changing the data size on the execution time. But these sorts of regressions are tricky since it's not clear what all of the relevant variables are. Any reference to analyzing performance data would be excellent, and benefit to many people I believe!
Thanks
Matt
Big apps like these are going to be doing a lot of non-CPU processing,
so to find optimization points
you're going to need wall-clock-based, not CPU-based, sampling.
gprof and some others only sample on CPU time, so they cannot see needless I/O or other system calls.
If you do manage to find and remove CPU-intensive performance problems, the I/O-intensive ones will only become a larger fraction of the time.
Take a look at Zoom.
It's a stack sampler that reports, by line of code, the percent of wall-clock time that line is on the stack.
Any code point worth optimizing will probably be such a line.
It also has a nice butterfly view for browsing the call graph.
(You don't want the call graph as a whole. It will be a meaningless rat's nest.)
I'm working on Windows Phone 7 which does not support features like CUDA or OpenCL. I'm new to the GPU side of things, Is there anything on the GPU that I can use to help speed up raytracing? Like triangle intersection tests? Or selecting the correct colour from a texture?
CUDA and the like are really just higher level languages for programming shaders, so any platform that supports programmable shaders allows you some capability to run general purpose calculations on the gpu.
Unfortunately, it looks like Windows Phone 7 does not support custom programmable shaders, so GPU acceleration for a ray tracer is not really possible at this time. Even if it was, it is very difficult to effecticely use a GPU for raytracing because of several very anti-GPU characteristics:
Poor memory coherency (each ray can easily interact with completely different geometry)
High branching factor (shaders work best with code that consistently follows a single path)
Large working set (A lot of geometry has to be accesable in memory at any one time to compute the outcome of even a single ray)
If your goal is to write a raytracer, it would probably be far easier to do completely on the CPU, and only then consider optimizations that are more esoteric.
Raytracing is still a bit slow, even on modern average desktop PC. You can speed it up by shooting just primary rays, but then rasterisation methods will be actually better and faster.
Are you certain, you want to do ray-tracing on a phone, which has even less compute power than PC? They are not designed to do that kind of work.
I'm searching for reliable data on OpenGL's functions performance. A site that could for example:
...answer me how much more efficient is using glInterleavedArrays compared to gl*Pointer based implementation with strides, or without them. If applicable, show the comparisions on nVidia vs. ATI cards vs. embedded systems.
...answer me how much of a boost is gained in using VBO's vs. non-buffered data in the cases of static, dynamic and stream data.
I'd like to find a site that has "no-bullshit" performance data, not just vague statements like "glInterleavedArrays are usually faster than direct gl*Pointer usage".
Is there such a dream-site? Or at least somewhere where I can get answers to the forementioned questions?
(yes, I know that nothing will beat hand-profiling, but the fact that something works faster on my machine, doesn't mean it's faster generally on all cards...)
It's more about application level benchmarking than measuring performance of individual features, but it might be possible to learn something from specviewperf, especially if it's possible to discover more about what OpenGL mode each benchmark uses to perform it's rendering. The benchmark seems to include some options to tweak usage of display lists, vertex arrays etc, but I don't think SPECs published results go into any analysis of the effects of changing these from the defaults. They don't seem to have any VBO coverage yet.